I started to come into contact with python scripts, and encountered the problem of Chinese garbled characters as soon as I started.
Combined with the information on the Internet, here is the summary:
The internal representation of strings in Python is unicode encoding. Therefore, when doing encoding conversion, it is usually necessary to use unicode as the intermediate encoding, that is, first convert other encoded strings Decode (decode) to unicode, and then encode (encode) from unicode to another encoding.
decode decoding, the function is to convert other encoded strings into unicode encoding, such as str1.decode('gb2312'), which means converting the gb2312 encoded string str1 into unicode encoding.
encode encoding, the function is to convert unicode encoding into other encoded strings, such as str2.encode('gb2312'), which means converting unicode encoded string str2 into gb2312 encoding.
If a string is already unicode, an error will occur when decoding it, so it is usually necessary to judge whether its encoding method is unicode:
isinstance(s, unicode) #Used to judge whether it is unicode
Using str in non-unicode encoding form to encode will result in an error
How to obtain the system’s default encoding?
#!/usr/bin/python
#coding=utf-8
import sys
print sys.getdefaultencoding()
The output of this program on English WindowsXP is: ascii
In some IDEs , the output of the string always appears garbled, or even wrong. This is actually because the IDE's result output console itself cannot display the encoding of the string, rather than a problem with the program itself.
If you run the following code in UliPad:
s=u"Chinese" #Specify Unicode encoding
print s
will prompt: UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128). This is because UliPad's console information output window on English Windows
Change the last sentence to: print s.encode('gb2312')
The word "Chinese" can be correctly output.
If the last sentence is changed to: print s.encode('utf8')
, then the output is: xe4xb8xadxe6x96x87, which is the result of the console information output window outputting the utf8-encoded string according to ascii encoding.
unicode(str,'gb2312') is the same as str.decode('gb2312'), both convert gb2312 encoded str into unicode encoding
Use str.__class__ to view the encoding form of str
Principle After talking for a long time, I came up with the code:
#coding=utf-8
#!/usr/bin/python
s="中文"
if isinstance(s, unicode):
print s .encode('gb2312')
else:
print s.decode('utf-8').encode('gb2312')