用 Node.js 的 http
来抓取某页面:
var http = require('http');
http.get('http://example.com', function (res) {
console.log(res.statusCode);
});
返回的状态码是 404,但是网页能够正常访问,我用自己服务器测试也是一样,所以应该没有 ban 掉我的 ip。这是否代表着对方已经通过服务器端禁掉了他人的抓取?
还有就顺带求教,这是如何做到的?
胡乱猜测,求各位大大指点一下,以上。
Check
User-Agent
andReferer
, and look atCookie
to see if the web page is dynamically generated using Ajax.You can use Chrome's "Developer Tools" or Firebug to see what the browser sends when you open the webpage again, and then add these things to your request.
You can take a look at the pyspider crawler tutorial (2): AJAX and HTTP
Although it is written based on pyspider, the principles are explained
I crawl the website you mentioned normally.
I wonder if there is an error in your program. How about crawling other websites?
Can you try this one I wrote? Anyway, I succeeded here.
I guess Content-Security-Policy may be set
Add all the headers when the browser sends a request, it should be OK.
It’s very simple. Your header does not have UA and is directly intercepted by the server as an attack. You can catch it by adding UA, Referer, etc.