html - python decode('utf-8')出现错误:invalid start byte?
阿神
阿神 2017-04-17 17:27:16
0
2
961

写python爬虫,做下载器时,发现部分网页(一部分可以)无法通过decode('utf-8)去解码,查看网页,网页却是有<meta charset=UTF-8>这句,说明是UTF-8编码,为何无法解码?

部分网页解码失败的错误代码:

Traceback (most recent call last):
  File "E:/python爬虫/test.py", line 13, in <module>
    print(data.decode('utf-8'))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

这里是我获取html数据并进行解码的相关代码:

url = 'http://wiki.52poke.com/wiki/%E8%B7%AF%E5%8D%A1%E5%88%A9%E6%AC%A7'
req = urllib.request.Request(url)
res = urllib.request.urlopen(req)
data = res.read()
print(data)
print(data.decode('utf-8'))

输出(这里是解码失败的网页的data数据输出的结果)(这里只贴出部分,太多了)

b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\xed\xbdys\x1bG\x96/\xfa\xf7\xe8S\xa0\xe1\xcbi{\xc6\xd8wJB\x07 Q\xe3~\xaf\xdd\xa3\xb1=3vx\xfa9@\xa2D\xa2\x05\x02\xb8\x00\xa8\xc5=\xfd\x02\x94Lq\'\xb5P\xd4Bj\xa1,J\xd4FR\x12-q\x15#\xde\xfd&\x16\xaa\x00\xc4\xbd\x11\xfe\n\xef\x9c\xcc\xaaBU\xa1\xb0\x14\tR\x90\x94\x9e\x1e\xb1\x90U\x95u2\xf3\xe4\xd9\xf2\xe4/\x0f\xfd\xee\xe8\xbf\x1e\xf9\xe6\xbb\xe3\x1d\xa6\x9elo<x\xe0\x10\xfe1\xc5#\x89\xee\xc3?\xf6\x98\xa2\xb1\xf4\xe1x6m\xea\x8aG2\x99\xc3]\xf1\x18\x97\xc8Z\x12\xc9\xbff\xf0A.\x12\x85?\xbd\\6b\xea\xea\x89\xa43\\\xf6\xf0\xbf\x7fs\xcc\xe2\x87\xc2l,

输出(解码成功的网页的代码)

b'<!DOCTYPE html>\n<html lang=zh dir=ltr class=client-nojs>\n<head>\n<meta charset=UTF-8>\n<title>\xe6\x80\xaa\xe6\xb2\xb3\xe9\xa9\xac - \xe7\xa5\x9e\xe5\xa5\x87\xe5\xae\x9d\xe8\xb4\x9d\xe7\x99\xbe\xe7\xa7\x91</title>\n<script>document.documentElement.className = document.documentElement.className.replace( /(^|\\s)client-nojs(\\s|$)/, "$1client-js$2" );</script>\n<script>window.RLQ = window.RLQ || []; window.RLQ.push( function () {\nmw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"\xe6\x80\xaa\xe6\xb2\xb3\xe9\xa9\xac","wgTitle":"\xe6\x80\xaa\xe6\xb2\xb3\xe9\xa9\xac","wgCurRevisionId":651454,"wgRevisionId":651454,"wgArticleId":602,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["\xe6\x8b\xa5\xe6\x9c\x89\xe6\xb2\x99\xe6\xb5\x81\xe7\x89\xb9\xe6\x80\xa7\xe7\x9a\x84\xe7\xa5\x9e\xe5\xa5\x87\xe5\xae\x9d\xe8\xb4\x9d","\xe7\xa5\x9e\xe5\xa5\xa5\xe5\x9c\xb0\xe6\x96\xb9\xe5\xae\x9d\xe5\x8f\xaf\xe6\xa2\xa6","\xe5\x8d\xa1\xe6\xb4\x9b\xe6\x96\xaf\xe5\x9c\xb0\xe6\x96\xb9\xe5\xae\x9d\xe5\x8f\xaf\xe6\xa2\xa6",

搞不懂为何部分会有!DOCTYPE html>\n<html lang=zh dir=ltr class=client-nojs>\n<head>\n<meta charset=UTF-8>这种,而部分则是\xkk这种形式的代码?
搞了一早上了依旧不明白,我在猜是不是字节数的关系使得部分解码不了?希望有大神能解疑

阿神
阿神

闭关修行中......

reply all(2)
洪涛

The part you're failing on doesn't look like any encoding, maybe binary data?

迷茫

The returned data is gzipped, you should decompress it first. Now that you know how to use BeautifulSoup when writing crawlers, why not use requests instead of urllib.

Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template
About us Disclaimer Sitemap
php.cn:Public welfare online PHP training,Help PHP learners grow quickly!