网页爬虫 - 【如图】python爬取的html页面和浏览器显示源码的结果不同
高洛峰
高洛峰 2017-04-18 09:31:41
0
4
766
高洛峰
高洛峰

拥有18年软件开发和IT教学经验。曾任多家上市公司技术总监、架构师、项目经理、高级软件工程师等职务。 网络人气名人讲师,...

reply all(4)
伊谢尔伦

After actual testing, the conclusion is that bs4 changes the order of attributes.

1. Right-click the page in the browser and select:

Censorship Element

View the web page source code

2. Comparison in python3 program:

import re
ptn_tr = re.compile(r'<tr[^>]+>')

import requests as req
rsp=req.get('http://www.pythonscraping.com/pages/page3.html')
html = rsp.text
print('requests:\t', ptn_tr.findall(html)[0])

from urllib.request import urlopen
rsp = urlopen("http://www.pythonscraping.com/pages/page3.html")
html = rsp.read().decode()
print('urlopen:\t', ptn_tr.findall(html)[0])

from bs4 import BeautifulSoup
html = str(BeautifulSoup(html,"lxml"))
print('bs4Soup:\t', ptn_tr.findall(html)[0])

Result:

requests:     <tr id="gift1" class="gift">
urlopen:     <tr id="gift1" class="gift">
bs4Soup:     <tr class="gift" id="gift1">
阿神

The order of class and id is just different.
If you use Chrome and Firefox to view the source code of the same web page, the order is also different.

小葫芦

It is recommended that the questioner post the website or even his own code so that everyone can help you debug it. It's normal to be different. If the content crawled by your crawler is saved as a static page and is different from what you see with the browser, then the other party's anti-crawler mechanism must have recognized it, so the server will return different information. There are many ways to identify crawlers. If you still have any questions, please feel free to ask again

巴扎黑

The poster recommends that you post all the source code, because the website can identify whether you are operating a human browser or a crawler.

Looking at the current code, it is recommended that you add header information! use-agent That line of code!

Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template