node.js - 关于抓取网页被禁止的问题
PHP中文网
PHP中文网 2017-04-17 11:30:38
0
6
582

用 Node.js 的 http 来抓取某页面:

var http = require('http');

http.get('http://example.com', function (res) {
  console.log(res.statusCode);
});

返回的状态码是 404,但是网页能够正常访问,我用自己服务器测试也是一样,所以应该没有 ban 掉我的 ip。这是否代表着对方已经通过服务器端禁掉了他人的抓取?
还有就顺带求教,这是如何做到的?

胡乱猜测,求各位大大指点一下,以上。

PHP中文网
PHP中文网

认证高级PHP讲师

reply all(6)
左手右手慢动作

Check User-Agent and Referer, and look at Cookie to see if the web page is dynamically generated using Ajax.
You can use Chrome's "Developer Tools" or Firebug to see what the browser sends when you open the webpage again, and then add these things to your request.

巴扎黑

You can take a look at the pyspider crawler tutorial (2): AJAX and HTTP
Although it is written based on pyspider, the principles are explained

阿神

I crawl the website you mentioned normally.

I wonder if there is an error in your program. How about crawling other websites?

<?php

$res = get('http://www.1yyg.com/');

echo $res[0];
echo $res[1];

function get($url, $cookie = '', $referer = '') {
    $ch = curl_init();
    curl_setopt_array($ch, [
        CURLOPT_URL => $url,
        CURLOPT_REFERER => $referer, 
        CURLOPT_USERAGENT => 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2164.0 Safari/537.36',
        CURLOPT_COOKIE => $cookie,
        CURLOPT_HEADER => 1,
        CURLOPT_RETURNTRANSFER => 1, 
        CURLOPT_TIMEOUT => 4
    ]);

    $response = curl_exec($ch);
    $header_size = curl_getinfo($ch, CURLINFO_HEADER_SIZE);
    $header = substr($response, 0, $header_size); //http_parse_headers
    $body = substr($response, $header_size);

    curl_close($ch);
    return [$header, $body];
}

Can you try this one I wrote? Anyway, I succeeded here.

伊谢尔伦

I guess Content-Security-Policy may be set

大家讲道理

Add all the headers when the browser sends a request, it should be OK.

PHPzhong

It’s very simple. Your header does not have UA and is directly intercepted by the server as an attack. You can catch it by adding UA, Referer, etc.

Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template