python - beautifulsoup解析中文网页的编码问题
大家讲道理
大家讲道理 2017-04-17 14:26:52
0
2
370

对于同一个页面,几乎同样的代码,在Python3,windows8环境下能够正常解析运行。但是把代码移植到Ubuntu,Python2.7下面之后,会出现获取的网页不能被beautifulsoup解析,find_all('table')返回空节点的情况。
出问题的代码的一部分(可以运行):

python#coding:utf-8
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
import urllib2
from bs4 import BeautifulSoup
postdata = "T1=&T2=1&T3=&T4=&T5=&APPDate=&T7=&T8=&T9=&PRDate=&T11=&SQDate=&JDDate=&T14=&T15=&T16=&T17=&SDDate=&T19=&T20=&T21=&D1=%B8%B4%C9%F3&D2=jdr&D3=%C9%FD%D0%F2&C1=fm&C2=&C3=&page=70"
postdata = postdata.encode('utf-8')
headers = {'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6','Referer':'http://app.sipo-reexam.gov.cn/reexam_out/searchdoc/searchfs.jsp'}
req = urllib2.Request(
      url = "http://app.sipo-reexam.gov.cn/reexam_out/searchdoc/searchfs.jsp",
      headers = headers,
      data = postdata)
fp  = urllib2.urlopen(req)
mybytes = fp.read().decode('gbk').encode('utf-8')
soup = BeautifulSoup(mybytes,from_coding="uft-8")
print soup.original_encoding
print soup.prettify()

求指点一二

大家讲道理
大家讲道理

光阴似箭催人老,日月如移越少年。

reply all(2)
阿神

Have you tried changing the parser?
The fault tolerance rate of python2.7's HTML parser is very poor.
lxml is recommended.

大家讲道理

Well, this is mainly an encoding issue. . . If you don't understand the encoding problem of python, it is definitely a big pitfall.
When I saw these sentences, they seemed to have some problems:

1. mybytes = fp.read().decode('gbk').encode('utf-8')
2. soup = BeautifulSoup(mybytes,from_coding="uft-8")
3. print soup.original_encoding
4. print soup.prettify()

Among them,

  1. No encoding conversion required, bs can accept any encoding, unicode is better. So even if the encoding is converted, it should only go to decode

  2. bs instance construction usage is BeautifulSoup(html, 'html5lib'), the second parameter is the interpreter, not the encoding.

  3. Just print soup and you will get the result. Whether to display Chinese or not is mainly related to encoding. The encoding conversion capability of bs is actually not that strong, so plain text calls will also cause problems

  4. soup.prettify('utf-8') can ensure that the output encoding is correct.

Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template
About us Disclaimer Sitemap
php.cn:Public welfare online PHP training,Help PHP learners grow quickly!