python - GBK中的Unicode编码
迷茫
迷茫 2017-04-17 11:58:15
0
4
1032

小弟最近使用Python处理一批新闻语料,主要的工作就是将标签中文本取出,并以空格分割字符。文件中的部分格式如下:

<doc>
<url>http://sports.sina.com.cn/euro2008/video/601/2008-06-06/105.html</url>
<docno>005e46d7ec87bc5b-63207783d4cca6e0</docno>
<contenttitle>视频-揭幕战捷克即将亮相 “东欧铁骑”信心十足</contenttitle>
<content>评论:1北京时间6月4日,捷克国家足球队离开奥地利的因斯布鲁克训练基地,赶赴瑞士准备与东道主的揭幕战。以上是相关视频报道。</content>
</doc>

细心的朋友或许注意到北京前面有个乱码,此外数字164都是全角的。全角转半角在unicode下容易,但若文件本身是非unicode编码(比如gbk),就需要先转码。但即使使用正确的解码器,还是无法对文件进行正确地解码,读取第一行就出错了:

>>> news_file.readline()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'gbk' codec can't decode byte 0xbf in position 2: illegal multibyte sequence

有些转义字符在gbk编码中出现,也会出现编解码错误的情况:

>>> s = "组图:震前汶川风光震前汶川风光 QQ群4914667.作者肚螂皮"
>>> s
'组图:震前汶川风光\ue40c震前汶川风光\u3000QQ群4914667.作者肚螂皮'
>>> news_file = open("D:/news_test.txt", "w")
>>> news_file.write(s)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'gbk' codec can't encode character '\ue40c' in position 9: illegal multibyte sequence
>>> 

我试过在open函数中使用errors="ignore",错误是没有了,但这样读取之前中的文本会产生乱码。

>>> news_file.readline()
'<content>组图:震前汶川风光U鹎般氪ǚ绻狻。眩讶海矗梗保矗叮叮罚作者肚螂皮</content>\n'

请问处理上述两种异常情况下并正确地读写文件?有没有开源包能够处理上述问题?

迷茫
迷茫

业精于勤,荒于嬉;行成于思,毁于随。

reply all(4)
小葫芦
>>> '组图:震前汶川风光\ue40c震前汶川风光\u3000QQ群4914667.作者肚螂皮'.encode('gbk')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'gbk' codec can't encode character '\ue40c' in position 9: illegal multibyte sequence
>>> '组图:震前汶川风光\ue40c震前汶川风光\u3000QQ群4914667.作者肚螂皮'.encode('gb18030')
b'\xd7\xe9\xcd\xbc\xa3\xba\xd5\xf0\xc7\xb0\xe3\xeb\xb4\xa8\xb7\xe7\xb9\xe2\xfd\xa3\xd5\xf0\xc7\xb0\xe3\xeb\xb4\xa8\xb7\xe7\xb9\xe2\xa1\xa1\xa3\xd1\xa3\xd1\xc8\xba\xa3\xb4\xa3\xb9\xa3\xb1\xa3\xb4\xa3\xb6\xa3\xb6\xa3\xb7\xa3\xae\xd7\xf7\xd5\xdf\xb6\xc7\xf2\xeb\xc6\xa4'


>>> '组图:震前汶川风光\ue40c震前汶川风光\u3000QQ群4914667.作者肚螂皮'.encode('gbk', errors='replace').decode('gbk')
'组图:震前汶川风光?震前汶川风光\u3000QQ群4914667.作者肚螂皮'
  1. The two questions you listed will be no problem if you use GB18030 encoding;
  2. GB* encoding is not inherently fault-tolerant, so if you encounter an encoding that cannot be converted, you should replace it with other characters instead of simply ignoring it to avoid causing garbled characters in subsequent characters (this is why "deleting half a Chinese character" in the DOS era caused garbled characters) .

I had no problem decoding your file using GB18030:

Python 3.3.3 (default, Nov 26 2013, 13:33:18) 
[GCC 4.8.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> d = open('news_test.xml', encoding='gb18030')
>>> c = d.read()
>>> 


Python 2.7.6 (default, Nov 26 2013, 12:52:49) 
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> d = open('news_test.xml') 
>>> b = d.read()
>>> c = b.decode('gb18030')
>>> 
Peter_Zhu

I also encountered such a problem, here is what I did:
1. First read with GB18030 and decode into Unicode
2. Convert full-width to half-width
3. Replace u'ue40c' with spaces
4. Finally, output according to utf-8
PS: In fact, I didn’t do item 3 at the beginning. Later, when doing word segmentation, I discovered this problem and filtered it again. In fact, I don’t know if there are other hidden problems. If you are not worried, you should re-filter according to GB18030. Output it

Attached is my full-width to half-width conversion code

def strq2b(ustring):
"""全角转半角"""
rstring = ""
for uchar in ustring:
    inside_code=ord(uchar)
    if inside_code == 12288:#全角空格直接转换
        inside_code = 32 
    elif (inside_code >= 65281 and inside_code <= 65374):#全角字符(除空格)根据关系转化
        inside_code -= 65248
    rstring += unichr(inside_code)
return rstring
迷茫

What encoding is the file D:/news_test.txt you wrote?

刘奇

But try chardet

Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template