python 爬取网页编码问题
大家讲道理
大家讲道理 2017-04-18 09:26:01
0
5
269

我在爬取凤凰网却出现
UnicodeEncodeError: 'gbk' codec can't encode character 'xa0' in position 151120: illegal multibyte sequence

这是我的代码

__author__ = 'my'
import urllib.request
url = 'http://www.ifeng.com/'
req = urllib.request.urlopen(url)
req = req.read()
req = req.decode('utf-8')
print(req)

为什么utf8却报错GBK?

大家讲道理
大家讲道理

光阴似箭催人老,日月如移越少年。

reply all(5)
Peter_Zhu

This is a problem with cmd.exe, other software can decode it correctly. For example, notepad, browser. . . .

import urllib.request
import os
url = 'http://www.ifeng.com/'
rsp = urllib.request.urlopen(url)
body = rsp.read()
html = r'C:\ifeng.html' # 文件路径, 可以改成你自己想要的
with open(html, 'wb') as w:
    w.write(body) # 直接以 二进制 写入文件,不必解码.
os.popen('notepad.exe ' + html) # 用 记事本 打开,就可以看到内容了.

Added:
In fact, you can also modify the encoding of cmd.exe to utf-8 (cp65001)
Steps:
1. Run CMD.exe
2, chcp 65001
3. Modify the font of the window properties
On the CMD window title bar Right-click, select "Properties"->"Font", and change the font to the True Type font "Lucida Console"
As shown in the picture:

4. Run python

Contents of

x.py:

import urllib.request

url = 'http://www.ifeng.com/'
rsp = urllib.request.urlopen(url)
body = rsp.read()
html = body.decode('utf-8')
print(html[:500]) # 前500个字符
#print(html) # 也可打印全部,看看有没有错
洪涛

I just put the code of the question into pycharm, and this problem did not occur. Then I used the Windows command prompt to type line by line, and this problem occurred. The windows command prompt uses gbk encoding, and the web page itself uses utf-8 for encoding. If you want to run it from the command line, you need to write:

`__author__ = 'my'
import urllib.request
url = 'http://www.ifeng.com/'
req = urllib.request.urlopen(url)
req = req.read()
req = req.decode('gbk', 'ignore')
print(req)`

Herereq = req.decode('gbk', 'ignore')Let me explain: To display in the windows command prompt, it needs to be decoded to gbk, but utf-8 itself has some characters that will fail to decode using gbk, so the second parameter ignore is needed , this parameter means discarding characters that cannot be decoded.
As an aside, encoding may also encounter this problem. For example, if you use the requests library to request, it will be the requested string instead of the byte type. If you encounter problems with encoding, you can also use str.encode('encoding', 'ingore ').decode('decode') to solve similar problems.
If you don’t understand, you can read this blog of mine

To answer a question from the subject, some web pages are fine. It may be that some web pages use GBK encoding or the text is compatible with both GBK and UTF-8

大家讲道理

It is estimated that the default encoding of your system is gbk, you can try it

import sys
reload(sys)
sys.setdefaultencoding('utf-8')
Ty80

Are you running it using Windows console? Because the default encoding of the console is gbk.
There is no problem if you use the interpreter that comes with python:

or use other tools instead of using the console.

巴扎黑

# _*_ coding: utf-8 _*_
Specify file encoding

import sys
reload(sys)
sys.setdefaultencoding('utf-8')

Declare the encoding of your program.

Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template