Write a simple Chinese word segmenter in Python-Python Tutorial-php.cn

Write a simple Chinese word segmenter in Python

高洛峰

Release： 2016-10-18 11:45:53

Original

1602 people have browsed it

After unzipping, take out the following files:

Training data: icwb2-data/training/pku_ training.utf8

Test data: icwb2-data/testing/pku_ test.utf8

Correct word segmentation result: icwb2-data/gold/pku_ test_ gold.utf8

Scoring tool: icwb2-data/script/socre

2 Algorithm description

The algorithm is the simplest forward maximum matching (FMM):

Generate a dictionary with training data

Pair the test data from Scan from left to right, and when you encounter the longest word, split it until the end of the sentence. Note: This is the original algorithm. This way, the code can be controlled within 60 lines. Later, looking at the test results, it was found that it was not processed well. For numerical problems, the processing of numbers has been added.

3 Source code and comments

#! /usr/bin/env python
# -*- coding: utf-8 -*-
   
# Author: minix
# Date:   2013-03-20
 
    
import codecs
import sys
    
# 由规则处理的一些特殊符号
numMath = [u&#39;0&#39;, u&#39;1&#39;, u&#39;2&#39;, u&#39;3&#39;, u&#39;4&#39;, u&#39;5&#39;, u&#39;6&#39;, u&#39;7&#39;, u&#39;8&#39;, u&#39;9&#39;]
numMath_suffix = [u&#39;.&#39;, u&#39;%&#39;, u&#39;亿&#39;, u&#39;万&#39;, u&#39;千&#39;, u&#39;百&#39;, u&#39;十&#39;, u&#39;个&#39;]
numCn = [u&#39;一&#39;, u&#39;二&#39;, u&#39;三&#39;, u&#39;四&#39;, u&#39;五&#39;, u&#39;六&#39;, u&#39;七&#39;, u&#39;八&#39;, u&#39;九&#39;, u&#39;〇&#39;, u&#39;零&#39;]
numCn_suffix_date = [u&#39;年&#39;, u&#39;月&#39;, u&#39;日&#39;]
numCn_suffix_unit = [u&#39;亿&#39;, u&#39;万&#39;, u&#39;千&#39;, u&#39;百&#39;, u&#39;十&#39;, u&#39;个&#39;]
special_char = [u&#39;(&#39;, u&#39;)&#39;]
    
    
def proc_num_math(line, start):
    """ 处理句子中出现的数学符号 """
    oldstart = start
    while line[start] in numMath or line[start] in numMath_suffix:
        start = start + 1
    if line[start] in numCn_suffix_date:
        start = start + 1
    return start - oldstart
    
def proc_num_cn(line, start):
    """ 处理句子中出现的中文数字 """
    oldstart = start
    while line[start] in numCn or line[start] in numCn_suffix_unit:
        start = start + 1
    if line[start] in numCn_suffix_date:
        start = start + 1
    return start - oldstart
    
def rules(line, start):
    """ 处理特殊规则 """
    if line[start] in numMath:
        return proc_num_math(line, start)
    elif line[start] in numCn:
        return proc_num_cn(line, start)
    
def genDict(path):
    """ 获取词典 """
    f = codecs.open(path,&#39;r&#39;,&#39;utf-8&#39;)
    contents = f.read()
    contents = contents.replace(u&#39;\r&#39;, u&#39;&#39;)
    contents = contents.replace(u&#39;\n&#39;, u&#39;&#39;)
    # 将文件内容按空格分开
    mydict = contents.split(u&#39; &#39;)
    # 去除词典List中的重复
    newdict = list(set(mydict))
    newdict.remove(u&#39;&#39;)
    
    # 建立词典
    # key为词首字，value为以此字开始的词构成的List
    truedict = {}
    for item in newdict:
        if len(item)>0 and item[0] in truedict:
            value = truedict[item[0]]
            value.append(item)
            truedict[item[0]] = value
        else:
            truedict[item[0]] = [item]
    return truedict
    
def print_unicode_list(uni_list):
    for item in uni_list:
        print item,
    
def divideWords(mydict, sentence):
    """
    根据词典对句子进行分词,
    使用正向匹配的算法，从左到右扫描，遇到最长的词，
    就将它切下来，直到句子被分割完闭
    """
    ruleChar = []
    ruleChar.extend(numCn)
    ruleChar.extend(numMath)
    result = []
    start = 0
    senlen = len(sentence)
    while start < senlen:
        curword = sentence[start]
        maxlen = 1
        # 首先查看是否可以匹配特殊规则
        if curword in numCn or curword in numMath:
            maxlen = rules(sentence, start)
        # 寻找以当前字开头的最长词
        if curword in mydict:
            words = mydict[curword]
            for item in words:
                itemlen = len(item)
                if sentence[start:start+itemlen] == item and itemlen > maxlen:
                    maxlen = itemlen
        result.append(sentence[start:start+maxlen])
        start = start + maxlen
    return result
    
def main():
    args = sys.argv[1:]
    if len(args) < 3:
        print &#39;Usage: python dw.py dict_path test_path result_path&#39;
        exit(-1)
    dict_path = args[0]
    test_path = args[1]
    result_path = args[2]
    
    dicts = genDict(dict_path)
    fr = codecs.open(test_path,&#39;r&#39;,&#39;utf-8&#39;)
    test = fr.read()
    result = divideWords(dicts,test)
    fr.close()
    fw = codecs.open(result_path,&#39;w&#39;,&#39;utf-8&#39;)
    for item in result:
        fw.write(item + &#39; &#39;)
    fw.close()
    
if __name__ == "__main__":
    main()

Copy after login

4 Testing and scoring results

Use dw.py to train the data, test the data, and generate the result file

Use score to score based on the training data, correct word segmentation results, and the results we generated

Use tail to check the overall score of the last few lines of the result file. In addition, socre.utf8 also provides a large number of comparison results, which can be used to find out where your own word segmentation results are not good enough

Note: The entire testing process is in Ubuntu Completed below

$ python dw.py pku_training.utf8 pku_test.utf8 pku_result.utf8

$ perl score pku_training.utf8 pku_test_gold.utf8 pku_result.utf8 > score.utf8

$ tail -22 score.utf8

INS ERTIONS: 0

DELETIONS: 0

SUBSTITUTIONS: 0

NCHANGE: 0

NTRUTH: 27

NTEST: 27

TRUE WORDS RECALL: 1.000

TEST WORDS PRECISION: 1.000

=== SUMMARY:

== = TOTAL INSERTIONS: 4623

=== TOTAL DELETIONS: 1740

=== TOTAL SUBSTITUTIONS: 6650

=== TOTAL NCHANGE: 13013

=== TOTAL TRUE WORD COUNT: 104372

=== TOTAL TEST WORD COUNT: 107255

=== TOTAL TRUE WORDS RECALL: 0.920

=== TOTAL TEST WORDS PRECISION: 0.895

=== F MEASURE: 0.907

=== OOV Rate: 0.940

=== OOV Recall Rate: 0.917

=== IV Recall Rate: 0.966

The FMM algorithm based on the dictionary is a very basic word segmentation algorithm. The effect is not that good, but it is simple enough and easy to start. As the learning deepens, I may also use Python to implement other word segmentation algorithms. Another feeling is that when reading, try to realize as much as possible. This will give you enough enthusiasm to pay attention to every detail of the theory, and you will not feel so boring.