About Chinese encoding issues in Python
The content of this article is about Chinese encoding issues in Python. It has certain reference value. Now I share it with everyone. Friends in need can refer to it
1. Chinese encoding issues in python
1.1 Encoding in .py files
Python’s default script files are all ANSCII Encoded, when there are characters in the file that are not within the ANSCII encoding range, you must use the "encoding instructions" to correct it. In the definition of a module, if the .py file contains Chinese characters (strictly speaking, it contains non-anscii characters), you need to specify the encoding statement on the first or second line:
# -*- coding =utf-8 -*-or #coding=utf-8 Other encodings such as: gbk, gb2312 are also acceptable; otherwise a similar message will appear: SyntaxError: Non-ASCII character '/xe4' in file ChineseTest.py on line 1, but no encoding declared; see exception information like http://www.pytho for details; n.org/peps/pep-0263.html
1.2 Encoding and decoding in python
First let’s talk about the string types in python. There are two string types in python, namely str and unicode. They are both derived classes of basestring; the str type is a character that contains Characters represent (at least) A sequence of 8-bit bytes; each unit of unicode is a unicode obj; so:
The value of len(u'China') is 2; the value of len('ab') is also 2;
There is this sentence in the documentation of str: The string data type is also used to represent arrays of bytes, e.g., to hold data read from a file. That is to say, when reading the contents of a file, or When reading content from the network, the maintained object is of str type; if you want to convert a str into a specific encoding type, you need to convert str to Unicode, and then convert from Unicode to a specific encoding type such as: utf-8, gb2312 etc.;
Conversion functions provided in python:
unicode to gb2312, utf-8, etc.
# -*- coding=UTF-8 -*- if __name__ == '__main__': s = u'中国' s_gb = s.encode('gb2312')
utf-8, GBK to unicode use the function unicode(s ,encoding) or s.decode(encoding)
# -*- coding=UTF-8 -*- if __name__ == '__main__': s = u'中国' #s为unicode先转为utf-8 s_utf8 = s.encode('UTF-8') assert(s_utf8.decode('utf-8') == s)
Convert ordinary str to unicode
# -*- coding=UTF-8 -*- if __name__ == '__main__': s = '中国' su = u'中国'' #s为unicode先转为utf-8 #因为s为所在的.py(# -*- coding=UTF-8 -*-)编码为utf-8 s_unicode = s.decode('UTF-8') assert(s_unicode == su) #s转为gb2312,先转为unicode再转为gb2312 s.decode('utf-8').encode('gb2312') #如果直接执行s.encode('gb2312')会发生什么? s.encode('gb2312') # -*- coding=UTF-8 -*- if __name__ == '__main__': s = '中国' #如果直接执行s.encode('gb2312')会发生什么? s.encode('gb2312')
An exception will occur here:
Python will Automatically decode s to unicode first, and then encode it to gb2312. Because decoding is performed automatically by python and we do not specify the decoding method, python will use the method specified by sys.defaultencoding to decode. In many cases sys.defaultencoding is ANSCII, and an error will occur if s is not of this type.
Take the above situation as an example, my sys.defaultencoding is ancii, and the encoding method of s is consistent with the encoding method of the file, which is utf8, so an error occurred: UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)
In this case, we have two ways to correct the error:
One is to clearly indicate the encoding method of s
#! /usr/bin/env python # -*- coding: utf-8 -*- s = '中文' s.decode('utf-8').encode('gb2312')
The second is to change sys.defaultencoding to the file encoding method
#! /usr/bin/env python # -*- coding: utf-8 -*- import sys reload(sys) # Python2.5 初始化后会删除 sys.setdefaultencoding 这个方法,我们需要重新载入 sys.setdefaultencoding('utf-8') str = '中文' str.encode('gb2312')
File encoding and print function
Create a file test. txt, the file format is ANSI, the content is:
abc中文
Use python to read
# coding=gbk
print open("Test.txt").read()
result :abc中文
Change the file format to UTF-8:
Result: abc涓枃
Obviously, decoding is needed here:
# coding=gbk import codecs print open("Test.txt").read().decode("utf-8")
Result: abc中文
I used Editplus to edit the above test.txt, but when I used Windows’ built-in Notepad to edit and save it in UTF-8 format,
an error occurred when running:
Traceback (most recent call last): File "ChineseTest.py", line 3, in <module> print open("Test.txt").read().decode("utf-8") UnicodeEncodeError: 'gbk' codec can't encode character u'/ufeff' in position 0: illegal multibyte sequence
Original , some software, such as notepad, will insert three invisible characters (0xEF 0xBB 0xBF, or BOM) at the beginning of the file when saving a file encoded in UTF-8.
So we need to remove these characters ourselves when reading. The codecs module in python defines this constant:
# coding=gbk import codecs data = open("Test.txt").read() if data[:3] == codecs.BOM_UTF8: data = data[3:] print data.decode("utf-8")
Result: abc Chinese
(4) Some remaining issues
In the second part, we use the unicode function and decode method to convert str into unicode. Why do the parameters of these two functions use "gbk"?
The first reaction is that we use gbk (# coding=gbk) in our coding statement, but is this really the case?
Modify the source file:
# coding=utf-8 s = "中文" print unicode(s, "utf-8")
Run, error:
Traceback (most recent call last): File "ChineseTest.py", line 3, in <module> s = unicode(s, "utf-8") UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-1: invalid data
Obviously, if the previous one is normal because gbk is used on both sides, then I keep it here The utf-8 on both sides is the same, so it should be normal and no error will be reported.
A further example, if we still use gbk for conversion here:
# coding=utf-8 s = "中文" print unicode(s, "gbk")
Result: Chinese
Principle of print in python:
When Python executes a print statement , it simply passes the output to the operating system (using fwrite() or something like it), and some other program is responsible for actually displaying that output on the screen. For example, on Windows, it might be the Windows console subsystem that displays the result. Or if you're using Windows and running Python on a Unix box somewhere else, your Windows SSH client is actually responsible for displaying the data. If you are running Python in an xterm on Unix, then xterm and your X server handle the display.
To print data reliably, you must know the encoding that this display program expects.
简单地说,python中的print直接把字符串传递给操作系统,所以你需要把str解码成与操作系统一致的格式。Windows使用CP936(几乎与gbk相同),所以这里可以使用gbk。
最后测试:
# coding=utf-8 s = "中文" rint unicode(s, "cp936") # 结果:中文
这也可以解释为何如下输出不一致:
>>> s="哈哈" >>> s' \xe5\x93\x88\xe5\x93\x88' >>> print s #这里为啥就可以呢? 见上文对print的解释 哈哈>>> import sys >>> sys.getdefaultencoding() ' ascii' >>> print s.encode('utf8') # s在encode之前系统默认按ascii模式把s解码为unicode,然后再encode为utf8 Traceback (most recent call last): File "<stdin>", line 1, in ? UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 0: ordinal not in range(128) >>> print s.decode('utf-8').encode('utf8') 哈哈 >>>
编码问题测试
使用 chardet 可以很方便的实现字符串/文件的编码检测
例子如下:
>>> import urllib>>> rawdata = urllib.urlopen('http://www.google.cn/').read()>>> import chardet >>> chardet.detect(rawdata){'confidence': 0.98999999999999999, 'encoding': 'GB2312'}>>>
chardet 下载地址 http://chardet.feedparser.org/
特别提示:
在工作中,经常遇到,读取一个文件,或者是从网页获取一个问题,明明看着是gb2312的编码,可是当使用decode转时,总是出错,这个时候,可以使用decode('gb18030')这个字符集来解决,如果还是有问题,这个时候,一定要注意,decode还有一个参数,比如,若要将某个String对象s从gbk内码转换为UTF-8,可以如下操作
s.decode('gbk').encode('utf-8′)
可是,在实际开发中,我发现,这种办法经常会出现异常:
UnicodeDecodeError: ‘gbk' codec can't decode bytes in position 30664-30665: illegal multibyte sequence
这 是因为遇到了非法字符——尤其是在某些用C/C++编写的程序中,全角空格往往有多种不同的实现方式,比如/xa3/xa0,或者/xa4/x57,这些 字符,看起来都是全角空格,但它们并不是“合法”的全角空格(真正的全角空格是/xa1/xa1),因此在转码的过程中出现了异常。
这样的问题很让人头疼,因为只要字符串中出现了一个非法字符,整个字符串——有时候,就是整篇文章——就都无法转码。
解决办法:
s.decode('gbk', ‘ignore').encode('utf-8′)
因为decode的函数原型是decode([encoding], [errors='strict']),可以用第二个参数控制错误处理的策略,默认的参数就是strict,代表遇到非法字符时抛出异常;
如果设置为ignore,则会忽略非法字符;
如果设置为replace,则会用?取代非法字符;
如果设置为xmlcharrefreplace,则使用XML的字符引用。
python文档
decode( [encoding[, errors]])
Decodes the string using the codec registered for encoding. encoding defaults to the default string encoding. errors may be given to set a different error handling scheme. The default is 'strict', meaning that encoding errors raise UnicodeError. Other possible values are 'ignore', 'replace' and any other name registered via codecs.register_error, see section 4.8.1.
详细出处参考:http://www.jb51.net/article/16104.htm
参考文献
【1】Python编码转换
【3】Python编码实现
The above is the detailed content of About Chinese encoding issues in Python. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics



Python excels in gaming and GUI development. 1) Game development uses Pygame, providing drawing, audio and other functions, which are suitable for creating 2D games. 2) GUI development can choose Tkinter or PyQt. Tkinter is simple and easy to use, PyQt has rich functions and is suitable for professional development.

PHP and Python each have their own advantages, and choose according to project requirements. 1.PHP is suitable for web development, especially for rapid development and maintenance of websites. 2. Python is suitable for data science, machine learning and artificial intelligence, with concise syntax and suitable for beginners.

The readdir function in the Debian system is a system call used to read directory contents and is often used in C programming. This article will explain how to integrate readdir with other tools to enhance its functionality. Method 1: Combining C language program and pipeline First, write a C program to call the readdir function and output the result: #include#include#include#includeintmain(intargc,char*argv[]){DIR*dir;structdirent*entry;if(argc!=2){

To maximize the efficiency of learning Python in a limited time, you can use Python's datetime, time, and schedule modules. 1. The datetime module is used to record and plan learning time. 2. The time module helps to set study and rest time. 3. The schedule module automatically arranges weekly learning tasks.

This article will guide you on how to update your NginxSSL certificate on your Debian system. Step 1: Install Certbot First, make sure your system has certbot and python3-certbot-nginx packages installed. If not installed, please execute the following command: sudoapt-getupdatesudoapt-getinstallcertbotpython3-certbot-nginx Step 2: Obtain and configure the certificate Use the certbot command to obtain the Let'sEncrypt certificate and configure Nginx: sudocertbot--nginx Follow the prompts to select

Developing a GitLab plugin on Debian requires some specific steps and knowledge. Here is a basic guide to help you get started with this process. Installing GitLab First, you need to install GitLab on your Debian system. You can refer to the official installation manual of GitLab. Get API access token Before performing API integration, you need to get GitLab's API access token first. Open the GitLab dashboard, find the "AccessTokens" option in the user settings, and generate a new access token. Will be generated

Configuring an HTTPS server on a Debian system involves several steps, including installing the necessary software, generating an SSL certificate, and configuring a web server (such as Apache or Nginx) to use an SSL certificate. Here is a basic guide, assuming you are using an ApacheWeb server. 1. Install the necessary software First, make sure your system is up to date and install Apache and OpenSSL: sudoaptupdatesudoaptupgradesudoaptinsta

Apache is the hero behind the Internet. It is not only a web server, but also a powerful platform that supports huge traffic and provides dynamic content. It provides extremely high flexibility through a modular design, allowing for the expansion of various functions as needed. However, modularity also presents configuration and performance challenges that require careful management. Apache is suitable for server scenarios that require highly customizable and meet complex needs.
