node.js - 关于抓取网页被禁止的问题

Question

用 Node.js 的 http 来抓取某页面： {代码...} 返回的状态码是 404，但是网页能够正常访问，我用自己服务器测试也是一样，所以应该没有 ban 掉我的 ip。这是否代表着对方已经通过服务器端禁掉了他人的抓取？ 还有...

ringa_lee · Answer

Check User-Agent and Referer, and look at Cookie to see if the web page is dynamically generated using Ajax.
You can use Chrome's "Developer Tools" or Firebug to see what the browser sends when you open the webpage again, and then add these things to your request.

巴扎黑 · Answer

You can take a look at the pyspider crawler tutorial (2): AJAX and HTTP
Although it is written based on pyspider, the principles are explained

阿神 · Answer

I crawl the website you mentioned normally.

I wonder if there is an error in your program. How about crawling other websites?

 $url,
        CURLOPT_REFERER => $referer, 
        CURLOPT_USERAGENT => 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2164.0 Safari/537.36',
        CURLOPT_COOKIE => $cookie,
        CURLOPT_HEADER => 1,
        CURLOPT_RETURNTRANSFER => 1, 
        CURLOPT_TIMEOUT => 4
    ]);

    $response = curl_exec($ch);
    $header_size = curl_getinfo($ch, CURLINFO_HEADER_SIZE);
    $header = substr($response, 0, $header_size); //http_parse_headers
    $body = substr($response, $header_size);

    curl_close($ch);
    return [$header, $body];
}

Can you try this one I wrote? Anyway, I succeeded here.

伊谢尔伦 · Answer

I guess Content-Security-Policy may be set

大家讲道理 · Answer

Add all the headers when the browser sends a request, it should be OK.

PHPz · Answer

It’s very simple. Your header does not have UA and is directly intercepted by the server as an attack. You can catch it by adding UA, Referer, etc.