After unzipping, take out the following files:
Training data: icwb2-data/training/pku_ training.utf8
Test data: icwb2-data/testing/pku_ test.utf8
Correct word segmentation result: icwb2-data/gold/pku_ test_ gold.utf8
Scoring tool: icwb2-data/script/socre
2 Algorithm description
The algorithm is the simplest forward maximum matching (FMM):
Generate a dictionary with training data
Pair the test data from Scan from left to right, and when you encounter the longest word, split it until the end of the sentence. Note: This is the original algorithm. This way, the code can be controlled within 60 lines. Later, looking at the test results, it was found that it was not processed well. For numerical problems, the processing of numbers has been added.
3 Source code and comments
#! /usr/bin/env python # -*- coding: utf-8 -*- # Author: minix # Date: 2013-03-20 import codecs import sys # 由规则处理的一些特殊符号 numMath = [u'0', u'1', u'2', u'3', u'4', u'5', u'6', u'7', u'8', u'9'] numMath_suffix = [u'.', u'%', u'亿', u'万', u'千', u'百', u'十', u'个'] numCn = [u'一', u'二', u'三', u'四', u'五', u'六', u'七', u'八', u'九', u'〇', u'零'] numCn_suffix_date = [u'年', u'月', u'日'] numCn_suffix_unit = [u'亿', u'万', u'千', u'百', u'十', u'个'] special_char = [u'(', u')'] def proc_num_math(line, start): """ 处理句子中出现的数学符号 """ oldstart = start while line[start] in numMath or line[start] in numMath_suffix: start = start + 1 if line[start] in numCn_suffix_date: start = start + 1 return start - oldstart def proc_num_cn(line, start): """ 处理句子中出现的中文数字 """ oldstart = start while line[start] in numCn or line[start] in numCn_suffix_unit: start = start + 1 if line[start] in numCn_suffix_date: start = start + 1 return start - oldstart def rules(line, start): """ 处理特殊规则 """ if line[start] in numMath: return proc_num_math(line, start) elif line[start] in numCn: return proc_num_cn(line, start) def genDict(path): """ 获取词典 """ f = codecs.open(path,'r','utf-8') contents = f.read() contents = contents.replace(u'\r', u'') contents = contents.replace(u'\n', u'') # 将文件内容按空格分开 mydict = contents.split(u' ') # 去除词典List中的重复 newdict = list(set(mydict)) newdict.remove(u'') # 建立词典 # key为词首字,value为以此字开始的词构成的List truedict = {} for item in newdict: if len(item)>0 and item[0] in truedict: value = truedict[item[0]] value.append(item) truedict[item[0]] = value else: truedict[item[0]] = [item] return truedict def print_unicode_list(uni_list): for item in uni_list: print item, def divideWords(mydict, sentence): """ 根据词典对句子进行分词, 使用正向匹配的算法,从左到右扫描,遇到最长的词, 就将它切下来,直到句子被分割完闭 """ ruleChar = [] ruleChar.extend(numCn) ruleChar.extend(numMath) result = [] start = 0 senlen = len(sentence) while start < senlen: curword = sentence[start] maxlen = 1 # 首先查看是否可以匹配特殊规则 if curword in numCn or curword in numMath: maxlen = rules(sentence, start) # 寻找以当前字开头的最长词 if curword in mydict: words = mydict[curword] for item in words: itemlen = len(item) if sentence[start:start+itemlen] == item and itemlen > maxlen: maxlen = itemlen result.append(sentence[start:start+maxlen]) start = start + maxlen return result def main(): args = sys.argv[1:] if len(args) < 3: print 'Usage: python dw.py dict_path test_path result_path' exit(-1) dict_path = args[0] test_path = args[1] result_path = args[2] dicts = genDict(dict_path) fr = codecs.open(test_path,'r','utf-8') test = fr.read() result = divideWords(dicts,test) fr.close() fw = codecs.open(result_path,'w','utf-8') for item in result: fw.write(item + ' ') fw.close() if __name__ == "__main__": main()
4 Testing and scoring results
Use dw.py to train the data, test the data, and generate the result file
Use score to score based on the training data, correct word segmentation results, and the results we generated
Use tail to check the overall score of the last few lines of the result file. In addition, socre.utf8 also provides a large number of comparison results, which can be used to find out where your own word segmentation results are not good enough
Note: The entire testing process is in Ubuntu Completed below
$ python dw.py pku_training.utf8 pku_test.utf8 pku_result.utf8
$ perl score pku_training.utf8 pku_test_gold.utf8 pku_result.utf8 > score.utf8
$ tail -22 score.utf8
INS ERTIONS: 0
DELETIONS: 0
SUBSTITUTIONS: 0
NCHANGE: 0
NTRUTH: 27
NTEST: 27
TRUE WORDS RECALL: 1.000
TEST WORDS PRECISION: 1.000
=== SUMMARY:
== = TOTAL INSERTIONS: 4623
=== TOTAL DELETIONS: 1740
=== TOTAL SUBSTITUTIONS: 6650
=== TOTAL NCHANGE: 13013
=== TOTAL TRUE WORD COUNT: 104372
=== TOTAL TEST WORD COUNT: 107255
=== TOTAL TRUE WORDS RECALL: 0.920
=== TOTAL TEST WORDS PRECISION: 0.895
=== F MEASURE: 0.907
=== OOV Rate: 0.940
=== OOV Recall Rate: 0.917
=== IV Recall Rate: 0.966
The FMM algorithm based on the dictionary is a very basic word segmentation algorithm. The effect is not that good, but it is simple enough and easy to start. As the learning deepens, I may also use Python to implement other word segmentation algorithms. Another feeling is that when reading, try to realize as much as possible. This will give you enough enthusiasm to pay attention to every detail of the theory, and you will not feel so boring.