python3.x - python 中的maketrans在utf-8檔中該怎麼使用
过去多啦不再A梦
过去多啦不再A梦 2017-05-18 10:58:56
0
1
752

我寫了一個處理文字的檔案就是把文字中所有的符號都替換掉,替換成空格。用的python中maketrans和translate。其中在使用對於ASCII編碼的檔案時是正常的,但對於utf-8檔案時,就報錯,提示maketrans中的參數不等長,但是明明是一樣長的啊:

File "/Users/lgq/Desktop/p3.py", line 10, in text_to_words

"abcdefghijklmnopqrstuvwxyz                                                   ") 

ValueError: the first two maketrans arguments must have equal length

我查了一下說是maketrans在utf-8下不能用,那我在utf-8下該怎麼替換掉字符呢,求各位大神指點。

def text_to_words(the_text):
    """ 
        Return a list of words with all punctuation removed,
        and all in lowercase.
    """
    my_substitutions = the_text.maketrans(
        # If you find any of these
        "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789!\"#$%&()*+,-./:;<=>?@[]^_`{|}~'\",
        # Replace them by these
        "abcdefghijklmnopqrstuvwxyz                                            ")
    # Translate the text now.
    cleaned_text = the_text.translate(my_substitutions)
    wds = cleaned_text.split()
    return wds


def get_words_in_book(filename):
    """ Read a book from filename, and return a list of its words."""
    f = open(filename, "r", encoding = "utf-8")
    content = f.read()
    f.close()
    wds = text_to_words(content)
    return wds


book_words = get_words_in_book("alice.txt")
print("There are {0} words in the book, the first 100 are\n{1}".
        format(len(book_words), book_words[:100]))
过去多啦不再A梦
过去多啦不再A梦

全部回覆(1)
滿天的星座

首先 這兩個字串長度不相等, " 是一个字符, \ 也是一个字符
你可以用 len() 查看。
接著關於字串什麼的問題,最好說明 python 的版本

maketrans 參數長度不相等

 my_substitutions = the_text.maketrans(
        # If you find any of these
        "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789!\"#$%&()*+,-./:;<=>?@[]^_`{|}~'\",
        # Replace them by these
        "abcdefghijklmnopqrstuvwxyz                                            ")

測試程式碼:

from string import translate, maketrans

def text_to_words(the_text):
    """ 
        Return a list of words with all punctuation removed,
        and all in lowercase.
    """
    my_substitutions = maketrans(
        # If you find any of these
        "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789!\"#$%&()*+,-./:;<=>?@[]^_`{|}~'\",
        # Replace them by these
        "abcdefghijklmnopqrstuvwxyz                                          ")
    # Translate the text now.
    cleaned_text = the_text.translate(my_substitutions)
    wds = cleaned_text.split()
    return wds

text_to_words('ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789!\"#$%&()*+,-./:;<=>?@[]^_`{|}~\'\测试')

output

['abcdefghijklmnopqrstuvwxyz', '\xe6\xb5\x8b\xe8\xaf\x95']

這是 python2 的運作結果

熱門教學
更多>
最新下載
更多>
網站特效
網站源碼
網站素材
前端模板