Python:如何过滤特殊字符和乱码的字符
迷茫
迷茫 2017-04-17 13:14:51
0
1
936

问题:有上万个文件,其中很多的文件名包含有特殊的字符(确切的说是乱码字符),想写个python脚本来过滤掉,只保留正常的文字(包括字母、数字和汉字)
首先想到的是使用正则来匹配,但是正则不是很懂,希望大神帮忙提示下,谢谢啦
乱码字符比如:

2W4mhTO?!t6X tX]错3窢塠朞?飙l?I汿?瓓?m:?识3I?霜???豚壥冂騏渖?慮玍0?w?N騃V?,腳?赿?Q?鸊ε`S
栳舅4Um瞘S?U{岁匭陈ybIPIh蟷(U剦缳h滑猈
留+&HR1錔碢s??Z邎遣?Zx趑U.w軎蝜锥e躸Y5z瓄埵涩?涨(<|I勀)??]t}  8?'鬖'抭??z?Ak栗醏胤?珇?g?5q顛J+乀?:pq陻謩BA$窳??+;?攉憴kAF?仇藅肆凶鬤~?闵楍曚H颴 €隔C 摶┦?K褡輈j?鹬嘙? Y肠颀爏? %y嫿3牏?瓎e?瞟蓐鯲
[妉灓€紜Z鸧旬墺asqp騚Q|?痘麱檎../mZe耪m??噡輍絙]宠s琗詬禈鈞
2S:陜??椣:_尙l譸氠彋氪?6棣?播9赲?UK蛌嬨zg璕}2?鑧嵉藴;抒库k
T7bc饓%p?鸃恫╤丛℡梯耽O^躹AyKI?m瀾▁跮滁u李'+煰鰰cM?竧堷傭媇SQ}走n-扉8I鈴淕夨?m猨+揠跶?"広`s
h鳩x

这只是我随便列举的,只要能过滤掉非正常的字符就行

迷茫
迷茫

业精于勤,荒于嬉;行成于思,毁于随。

reply all(1)
洪涛

None of the characters here are correct, it is a decoding problem. For example, it turned out to be ascii, but it turned out to be decoded using utf-8, or ISO-Latin-1, and it naturally became garbled.

========

Since you updated the question, I will also update the answer.

Write a loop, scan all the words aside, and then use hex(ord(VARIABLE)) to spell them together to see if they are within the range.

Of course, no matter how you do it, it will look stupid.

Reference:
Chinese UTF-8 Range

Unicode Howto

Unicode In Python, Completely Demystified

Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template
About us Disclaimer Sitemap
php.cn:Public welfare online PHP training,Help PHP learners grow quickly!