网页爬虫 - Python爬虫如何正确判断页面是否可以爬取？

Question

用Python27些爬虫，想要爬取一些网站，我需要判断网页是否可以爬取，第一反应是通过状态码来判断，但是写完运行后发现有许多目标网站访问它不存在的页面时会返回一个404错误页面，可他的状态码却是200，结果爬回...

阿神 · Answer

First of all, the 200 status code is the network connection status, so you only judge 200 and it does not satisfy all websites.

Secondly, when writing a crawler, you should actually see what the rules of these websites are. You can make a manual judgment first and look for the rules. For example, see if the content returned by the web page has any characteristics.

黄舟 · Answer

Make a judgment on the content of the web page. If there is no content in the web page, return it directly.

怪我咯 · Answer

Even if the page status code is 200, the returned 404 page should have different html elements from the normal crawlable page html. You can also judge whether it is a 404 page based on whether there are specific html elements