对于同一个页面,几乎同样的代码,在Python3,windows8环境下能够正常解析运行。但是把代码移植到Ubuntu,Python2.7下面之后,会出现获取的网页不能被beautifulsoup解析,find_all('table')返回空节点的情况。
出问题的代码的一部分(可以运行):
python
#coding:utf-8 import sys reload(sys) sys.setdefaultencoding('utf-8') import urllib2 from bs4 import BeautifulSoup postdata = "T1=&T2=1&T3=&T4=&T5=&APPDate=&T7=&T8=&T9=&PRDate=&T11=&SQDate=&JDDate=&T14=&T15=&T16=&T17=&SDDate=&T19=&T20=&T21=&D1=%B8%B4%C9%F3&D2=jdr&D3=%C9%FD%D0%F2&C1=fm&C2=&C3=&page=70" postdata = postdata.encode('utf-8') headers = {'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6','Referer':'http://app.sipo-reexam.gov.cn/reexam_out/searchdoc/searchfs.jsp'} req = urllib2.Request( url = "http://app.sipo-reexam.gov.cn/reexam_out/searchdoc/searchfs.jsp", headers = headers, data = postdata) fp = urllib2.urlopen(req) mybytes = fp.read().decode('gbk').encode('utf-8') soup = BeautifulSoup(mybytes,from_coding="uft-8") print soup.original_encoding print soup.prettify()
求指点一二
Have you tried changing the parser?
The fault tolerance rate of python2.7's HTML parser is very poor.
lxml is recommended.
Well, this is mainly an encoding issue. . . If you don't understand the encoding problem of python, it is definitely a big pitfall.
When I saw these sentences, they seemed to have some problems:
Among them,
No encoding conversion required, bs can accept any encoding, unicode is better. So even if the encoding is converted, it should only go to decode
bs instance construction usage is
BeautifulSoup(html, 'html5lib')
, the second parameter is the interpreter, not the encoding.Just
print soup
and you will get the result. Whether to display Chinese or not is mainly related to encoding. The encoding conversion capability of bs is actually not that strong, so plain text calls will also cause problemssoup.prettify('utf-8') can ensure that the output encoding is correct.