网页爬虫 - 【如图】python爬取的html页面和浏览器显示源码的结果不同

Question

如图所示,用python爬取的html页面和浏览器显示的内容有些出入。照理说爬取的同样是服务器端静态的html页面，怎么会有不同呢？ 网站地址：点我点我 爬虫源码： {代码...} 这个问题可能有点“孔乙己”，不过我确实挺...

伊谢尔伦 · Answer

After actual testing, the conclusion is that bs4 changes the order of attributes.

1. Right-click the page in the browser and select:

Censorship Element

View the web page source code

2. Comparison in python3 program:

import re
ptn_tr = re.compile(r']+>')

import requests as req
rsp=req.get('http://www.pythonscraping.com/pages/page3.html')
html = rsp.text
print('requests:	', ptn_tr.findall(html)[0])

from urllib.request import urlopen
rsp = urlopen("http://www.pythonscraping.com/pages/page3.html")
html = rsp.read().decode()
print('urlopen:	', ptn_tr.findall(html)[0])

from bs4 import BeautifulSoup
html = str(BeautifulSoup(html,"lxml"))
print('bs4Soup:	', ptn_tr.findall(html)[0])

Result:

requests:     
urlopen:     
bs4Soup:

阿神 · Answer

The order of class and id is just different.
If you use Chrome and Firefox to view the source code of the same web page, the order is also different.

高洛峰 · Answer

It is recommended that the questioner post the website or even his own code so that everyone can help you debug it. It's normal to be different. If the content crawled by your crawler is saved as a static page and is different from what you see with the browser, then the other party's anti-crawler mechanism must have recognized it, so the server will return different information. There are many ways to identify crawlers. If you still have any questions, please feel free to ask again

巴扎黑 · Answer

The poster recommends that you post all the source code, because the website can identify whether you are operating a human browser or a crawler.

Looking at the current code, it is recommended that you add header information! use-agent That line of code!