python and coding

黄舟
Release: 2016-12-16 11:34:24
Original
1207 people have browsed it

Literal objects in Python

The objects that handle text in Python 3.x include str, bytes, and bytearray.

bytes and bytearray can use almost all str methods except those used for formatting (format, format_map) and several special Unicode-based methods (casefold, isdecimal, isidentifier, isnumeric, isPRintable, encode).

bytes has a class method that can construct a string through sequences, but this method cannot be used on str.

>>> b = bytes.fromhex('E4 B8 AD')
>>> b
b'xe4xb8xad'
>>> b.decode('utf-8')
'中'
>>> str(b)
"b'\xe4\xb8\xad'"

Unicode and character conversion

Use chr to convert a Unicode code point into a character, through ord can perform reverse operations.

>>> ord('A')
65
>>> ord('中')
20013
>>> chr(65)
'A'
>> '中'.encode('utf-8')
b'xe4xb8xad'
>>> len('中'.encode('utf-8')) #Calculates the length of the bytes object, including 3 Integer characters

3

Python and encoding

The way Python handles encoding internally

When Python accepts our input, it will always be converted to Unicode first. And the sooner this process happens, the better.
Then Python processing is always performed on Unicode. During this process, encoding conversion must not be performed.
When Python returns results to us, it will always be converted from Unicode to the encoding we need. And the later this process happens, the better.

Python source code encoding method

Python uses utf-8 encoding by default.

If we want to use a different encoding to save Python code, we can place an encoding declaration on the first or second line of each file (if the first line is occupied by the hash-bang command)

# ‐*‐ coding: windows‐1252 ‐*‐

Coding used in Python

C:UsersJL>chcp #Find the coding used by the operating system

Active code page: 936

>>> import sys, locale
>>> locale.getpreferredencoding() #This is the most important
'cp936'

>>> my_file = open('cafe.txt','r')

>>> type(my_file)

>>> my_file.encoding #The file object uses the value of locale.getpreferreddecoding() by default
'cp936'
>>> sys. stdout.isatty(), sys.stdin.isatty(), sys.stderr.isatty() #Whether output is the console
(True, True, True)
>>> sys.stdout.encoding, sys .stdin.encoding, sys.stderr.encoding #If the standard control flow of sys is redirected or directed to a file, then the encoding will use the value of the environment variable PYTHONIOENCODING, the encoding of the console, or the encoding of locale.getpreferredencoding() , the priority gradually decreases.
('cp936', 'cp936', 'cp936')
>>> sys.getdefaultencoding() #If Python needs to convert binary data into a character object, then use this value by default.
'utf-8'
>>> sys.getfilesystemencoding() #When Python is used to encode or decode file names (not file contents), this encoding is used by default.
'mbcs'


The above are the test results in Windows. If in GNU/linux or OSX, then all the results are UTF-8.
For the difference between mbcs and utf-8, you can refer to http:/ /stackoverflow.com/questions/3298569/difference-between-mbcs-and-utf-8-on-windows

Encoding for file reading and writing

>>> pen('cafe.txt','w' ,encoding='utf-8').write('café')
4

>>> fp = open('cafe.txt','r')

>>> fp.read( )
'caf Mao'

>>> fp.encoding

'cp936'

>>> open('cafe.txt','r', encoding = 'cp936').read()

'caf Mao'
>>> open('cafe.txt','r', encoding = 'latin1').read()
'café'
>>> fp = open(' cafe.txt','r', encoding = 'utf-8')
>>> fp.encoding
'utf-8'

As you can see from the above example, do not use the default encoding at any time, because unexpected problems will occur when running on different machines.

How Python handles troubles from Unicode

Python always compares the size of strings through code points, or whether they are equal.

There are two ways to represent accent symbols in Unicode, represented by one byte, or represented by base letters plus accent symbols. In Unicode, they are equal, but in Python, because the code point is used to compare sizes, so Not equal anymore.

>>> c1 = 'cafeu0301'
>>> c2 = 'café'
>>> c1 == c2
False
>>> len(c1), len(c2)
(5, 4)

The solution is to use the normalize function in the unicodedata library. The first parameter of this function can accept four "NFC",'NFD','NFKC','NFKD' one of the parameters.
NFC (Normalization Form Canonical Composition): Decompose it in a standard equivalent way, and then reassemble it in a standard equivalent way. If it is a singleton, the reorganization result may be different from before decomposition. Shorten the length of the entire string as much as possible, so the 2 bytes of 'eu0301' will be compressed into one byte of 'é'.
NFD (Normalization Form Canonical Decomposition): Decompose in a standard equivalent way
NFKD (Normalization Form Compatibility Decomposition): Decompose in a compatible equivalent way
NFKC (Normalization Form Compatibility Composition): Decompose in a compatible equivalent way , and then reassemble it with standard equivalents.
NFKC and NFKD may cause data loss.

from unicodedata import normalize
>>> c3 = normalize('NFC',c1) #Operation c1 in the direction of shortening the string length
>>> len(c3)
4
> >> c3 == c2
True
>>> c4 = normalize('NFD',c2)
>>> len(c4)
5
>>> c4 == c1
True

Western keyboards usually type the shortest possible string, which means the result is consistent with "NFC", but it is safer to operate it through "NFC" and then compare whether the strings are equal. And W3C recommends using "NFC" results.

The same character has two different encodings in Unicode.
This function will convert a single Unicode character into another Unicode character.

>>> o1 = 'u2126'
>>> o2 = 'u03a9'
>>> o1, o2
('Ω', 'Ω')
>> > o1 == o2
False
>>> name(o1), name(o2)
('OHM SIGN', 'GREEK CAPITAL LETTER OMEGA')
>>> o3 = normalize(' NFC',o1)
>>> name(o3)
'GREEK CAPITAL LETTER OMEGA'
>>> o3 == o2
True

Another example is

>>> u1 = 'u00b5'
>>> u2 = 'u03bc'
>>> u1,u2
('µ', 'µ')
>>> name(u1), name( u2)
('MICRO SIGN', 'GREEK SMALL LETTER MU')
>>> u3 = normalize('NFKD',u1)
>>> name(u3)
'GREEK SMALL LETTER MU '

Another example

>>> h1 = 'u00bd'
>>> h2 = normalize('NFKC',h1)
>>> h1, h2
(' ½', '1⁄2')
>>> len(h1), len(h2)
(1, 3)

Sometimes we want to use case-insensitive comparison
How to use str.casefold(), this method will convert uppercase letters to lowercase for comparison, for example, 'A' will be converted to 'a', 'µ' of 'MICRO SIGN' will be converted to 'µ' of 'GREEK SMALL LETTER MU'
In most (98.9%) cases, the results of str.casefold() and str.lower() are consistent.

Text sorting
Due to different language rules, if you simply follow Python's method of comparing code points, there will be many results that are not what users expect.
Usually locale.strxfrm is used for sorting.

>>> import locale
>>> locale.setlocale(locale.LC_COLLATE,'pt_BR.UTF-8')
'pt_BR.UTF-8'
>>> sort_result = sorted(intial, key = locale.strxfrm)

Encoding and decoding error

If a decoding error occurs in the Python source code, a SyntaxError exception will be generated.
In other cases, if encoding and decoding errors are found, UnicodeEncodeError and UnicodeDecodeError exceptions will be generated.

Several useful methods taken from fluent python

from unicodedata import normalize, combining
def nfc_equal(s1, s2):
   '''return True if string s1 is eual to string s2 after normalization under "NFC" '''
   return normalize("NFC",s1) == normalize("NFC",s2)

def fold_equal(s1, s2):
   '''return True if string s1 is eual to string s2 after normalization under "NFC" and casefold()'''
   return normalize('NFC',s1).casefold() == normalize('NFC',s2).casefold()

def shave_marks(txt):
   '''Remove all diacritic marks
   basically it only need to change Latin text to pure ASCII, but this func will change Greek letters also
   below shave_latin_marks func is more precise'''

   normal_txt = normalize('NFD',txt)
   shaved = ''.join(c for c in normal_txt if not combining(c))
   return normalize('NFC',shaved)

def shave_latin_marks(txt):
   '''Remove all diacritic marks from Latin base characters'''
   normal_txt = normalize('NFD',txt)
   keeping = []
   latin_base=False
   for c in normal_txt:
       if combining(c) and latin_base:
           continue    #Ingore diacritic marks on Latin base char
       keeping.append(c)
       #If it's not combining char, it should be a new base char
       if not combining(c):
           latin_base = c in string.ascii_letters

编码探嗅Chardet

这是Python的标准模块。

参考资料:

http://blog.csdn.net/tcdddd/article/details/8191464

 以上就是python与编码的内容,更多相关文章请关注PHP中文网(www.php.cn)!


Related labels:
source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template