python中的编码知识整理汇总
问题
在平时工作中,遇到了这样的错误:
UnicodeDecodeError: 'ascii' codec can't decode byte
想必大家也都碰到过,很常见 。于是决定对python的编码做一个整理和学习。
基础知识
在python2.x中,有两种数据类型,unicode和str,这两个都是basestring的子类
>>> a = '中' >>> type(a) <type 'str'> >>> isinstance(a,basestring) True >>> a = u'中' >>> type(a) <type 'unicode'> >>> isinstance(a,basestring) True
两者的区别,概括来讲,str是字节串,由unicode经过编码(encode)后的字节组成的(好比与python3.x的byte);unicode是对象,才是真正意义上的字符串,由字符组成
>>> a='中文' >>> len(a) 6 >>> repr(a) "'\\xe4\\xb8\\xad\\xe6\\x96\\x87'" >>> b=u'中文' >>> len(b) 2 >>> repr(b) "u'\\u4e2d\\u6587'"
控制台和脚本
在linux下的python控制台执行以下命令,所得的结果和执行脚本是不同的
>>> a = u'中文' >>> repr(a) "u'\\xe4\\xb8\\xad\\xe6\\x96\\x87'" >>> b = unicode('中文','utf-8')b) >>> repr(b) "u'\\u4e2d\\u6587'"
可以看到,u'中文'初始化的对象a不是我们所期望的,那究竟是什么原因呢?
将python看成是一根管子,管子里头处理的中间过程都是使用unicode的。入口处,全部转成unicode;出口处,再转成目标编码(当然,有例外,处理逻辑中要用到具体编码的情况)。
在控制台执行命令a = u'中文',可以将解释为命令,a = ‘中文'.decode(encode),从而到到unicode对象a。那么这里的encode是什么呢?对于控制台来说,就是标准输入,即sys.stdin.encoding
>>> sys.stdin.encoding 'ISO-8859-1'
我的这边控制台默认的编码是ISO-8859-1,故a = u'中文' <=> a = '中文'.decode('ISO-8859-1')
这里的'中文'是控制台理解的,即使根据终端编码方式编码后的字节码,对于utf-8编码的终端,'中文'='\\xe4\\xb8\\xad\\xe6\\x96\\x87'
>>> a='中文'.decode('ISO-8859-1') >>> repr(a) "u'\\xe4\\xb8\\xad\\xe6\\x96\\x87'"
那如何修改此编码值呢,设置为什么呢?在linux环境中设置环境变量方法如下,具体设置什么只要与终端编码方式一直即可
export PYTHONIOENCODING=UTF-8
总结
重新回到最初的那个问题,造成问题的原因是没有搞清楚unicode和str的区别,将两者进行了混用。
>>> a = '中文' >>> a.encode('gbk') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)
以上的对象a其实是str,即字节码,若终端是utf-8编码的话,那么a就是用utf-8 encode的字节码。a.encode('gbk') 等价于a.decode(encoding).encode('gbk'),即先将字节码解码为unicode字符,然后再encode为字节码。unicode对象作为中转站。那么这里的encoding是什么呢?
>>> import sys >>> sys.getdefaultencoding() 'ascii'
默认是ascii,这正是错误为什么报无法用ascii解码的原因
>>> reload(sys) <module 'sys' (built-in)> >>> sys.setdefaultencoding('utf-8') >>> a = '中文' >>> repr(a) "'\\xe4\\xb8\\xad\\xe6\\x96\\x87'" >>> a.encode('gbk') '\xd6\xd0\xce\xc4'
将默认编码改为utf-8,即可。不鼓励对str使用encode方法,因为其中隐式对str进行了解码。decode只对str,encode只对unicode,一切decode/encode都显示指定编码方式。

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics

This tutorial demonstrates how to use Python to process the statistical concept of Zipf's law and demonstrates the efficiency of Python's reading and sorting large text files when processing the law. You may be wondering what the term Zipf distribution means. To understand this term, we first need to define Zipf's law. Don't worry, I'll try to simplify the instructions. Zipf's Law Zipf's law simply means: in a large natural language corpus, the most frequently occurring words appear about twice as frequently as the second frequent words, three times as the third frequent words, four times as the fourth frequent words, and so on. Let's look at an example. If you look at the Brown corpus in American English, you will notice that the most frequent word is "th

This article explains how to use Beautiful Soup, a Python library, to parse HTML. It details common methods like find(), find_all(), select(), and get_text() for data extraction, handling of diverse HTML structures and errors, and alternatives (Sel

This article compares TensorFlow and PyTorch for deep learning. It details the steps involved: data preparation, model building, training, evaluation, and deployment. Key differences between the frameworks, particularly regarding computational grap

Python's statistics module provides powerful data statistical analysis capabilities to help us quickly understand the overall characteristics of data, such as biostatistics and business analysis. Instead of looking at data points one by one, just look at statistics such as mean or variance to discover trends and features in the original data that may be ignored, and compare large datasets more easily and effectively. This tutorial will explain how to calculate the mean and measure the degree of dispersion of the dataset. Unless otherwise stated, all functions in this module support the calculation of the mean() function instead of simply summing the average. Floating point numbers can also be used. import random import statistics from fracti

Serialization and deserialization of Python objects are key aspects of any non-trivial program. If you save something to a Python file, you do object serialization and deserialization if you read the configuration file, or if you respond to an HTTP request. In a sense, serialization and deserialization are the most boring things in the world. Who cares about all these formats and protocols? You want to persist or stream some Python objects and retrieve them in full at a later time. This is a great way to see the world on a conceptual level. However, on a practical level, the serialization scheme, format or protocol you choose may determine the speed, security, freedom of maintenance status, and other aspects of the program

The article discusses popular Python libraries like NumPy, Pandas, Matplotlib, Scikit-learn, TensorFlow, Django, Flask, and Requests, detailing their uses in scientific computing, data analysis, visualization, machine learning, web development, and H

In this tutorial you'll learn how to handle error conditions in Python from a whole system point of view. Error handling is a critical aspect of design, and it crosses from the lowest levels (sometimes the hardware) all the way to the end users. If y

This tutorial builds upon the previous introduction to Beautiful Soup, focusing on DOM manipulation beyond simple tree navigation. We'll explore efficient search methods and techniques for modifying HTML structure. One common DOM search method is ex
