python - GBK中的Unicode编码

Question

小弟最近使用Python处理一批新闻语料，主要的工作就是将和标签中文本取出，并以空格分割字符。文件中的部分格式如下： {代码...} 细心的朋友或许注意到北京前面有个乱码，此外数字1、6和4都是全角的。全角转半角...

高洛峰 · Answer

>>> '组图：震前汶川风光\ue40c震前汶川风光\u3000ＱＱ群４９１４６６７．作者肚螂皮'.encode('gbk')
Traceback (most recent call last):
  File "", line 1, in 
UnicodeEncodeError: 'gbk' codec can't encode character '\ue40c' in position 9: illegal multibyte sequence
>>> '组图：震前汶川风光\ue40c震前汶川风光\u3000ＱＱ群４９１４６６７．作者肚螂皮'.encode('gb18030')
b'\xd7\xe9\xcd\xbc\xa3\xba\xd5\xf0\xc7\xb0\xe3\xeb\xb4\xa8\xb7\xe7\xb9\xe2\xfd\xa3\xd5\xf0\xc7\xb0\xe3\xeb\xb4\xa8\xb7\xe7\xb9\xe2\xa1\xa1\xa3\xd1\xa3\xd1\xc8\xba\xa3\xb4\xa3\xb9\xa3\xb1\xa3\xb4\xa3\xb6\xa3\xb6\xa3\xb7\xa3\xae\xd7\xf7\xd5\xdf\xb6\xc7\xf2\xeb\xc6\xa4'


>>> '组图：震前汶川风光\ue40c震前汶川风光\u3000ＱＱ群４９１４６６７．作者肚螂皮'.encode('gbk', errors='replace').decode('gbk')
'组图：震前汶川风光?震前汶川风光\u3000ＱＱ群４９１４６６７．作者肚螂皮'

The two questions you listed will be no problem if you use GB18030 encoding;
GB* encoding is not inherently fault-tolerant, so if you encounter an encoding that cannot be converted, you should replace it with other characters instead of simply ignoring it to avoid causing garbled characters in subsequent characters (this is why "deleting half a Chinese character" in the DOS era caused garbled characters) .

I had no problem decoding your file using GB18030:

Python 3.3.3 (default, Nov 26 2013, 13:33:18) 
[GCC 4.8.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> d = open('news_test.xml', encoding='gb18030')
>>> c = d.read()
>>> 


Python 2.7.6 (default, Nov 26 2013, 12:52:49) 
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> d = open('news_test.xml') 
>>> b = d.read()
>>> c = b.decode('gb18030')
>>>

天蓬老师 · Answer

I also encountered such a problem, here is what I did:
1. First read with GB18030 and decode into Unicode
2. Convert full-width to half-width
3. Replace u'ue40c' with spaces
4. Finally, output according to utf-8
PS: In fact, I didn’t do item 3 at the beginning. Later, when doing word segmentation, I discovered this problem and filtered it again. In fact, I don’t know if there are other hidden problems. If you are not worried, you should re-filter according to GB18030. Output it

Attached is my full-width to half-width conversion code

def strq2b(ustring):
"""全角转半角"""
rstring = ""
for uchar in ustring:
    inside_code=ord(uchar)
    if inside_code == 12288:#全角空格直接转换
        inside_code = 32 
    elif (inside_code >= 65281 and inside_code <= 65374):#全角字符（除空格）根据关系转化
        inside_code -= 65248
    rstring += unichr(inside_code)
return rstring

迷茫 · Answer

What encoding is the file D:/news_test.txt you wrote?

怪我咯 · Answer

But try chardet