Three commonly used Python Chinese word segmentation tools
Apr 14, 2018 am 11:05 AMThis article shares with you three commonly used Python Chinese word segmentation tools, which have certain reference value. Friends in need can refer to
These three word segmentation tools are available in Share here~
1.jieba participle:
# -*- coding: UTF-8 -*- import os import codecs import jieba seg_list = jieba.cut('邓超,1979年出生于江西南昌,中国内地男演员、电影导演、投资出品人、互联网投资人。') f1 = codecs.open("d2w_ltp.txt","w") print "/".join(seg_list) for i in seg_list: f1.write(i.encode("utf-8")) f1.write(str(" "))
Effect:
邓超/,/1979/年出/生于/江西/南昌/,/中国/内地/男演员/、/电影/导演/、/投资/出品人/、/互联网/投资人/。
This includes the stuttering participle and the form of writing to the file
It is worth noting that the character encoding derived from stuttering word segmentation is 'Unicode' encoding. We need to convert unicode -> utf-8
r = open('text_no_seg.txt','r') list_senten = [] sentence = '邓超,1979年出生于江西南昌,中国内地男演员、电影导演、投资出品人、互联网投资人。' for i in seg(sentence): list_senten.append(i[0]) print "/".join(list_senten) f1 = codecs.open("d2w_ltp.txt","w") for i in seg(sentence): f1.write(i[0]) f1.write(str(" "))
邓超/,/1979年/出生/于/江西/南昌/,/中国/内地/男/演员/、/电影/导演/、/投资/出品/人/、/互联网/投资人/。
邓超 nr , wd 1979年 t 出生 vi 于 p 江西 ns 南昌 ns , wd 中国 ns 内地 s 男 b 演员 n 、 wn 电影 n 导演 n 、 wn 投资 n 出品 vi 人 n 、 wn 互联网 n 投资人 n 。 wj
# -*- coding: UTF-8 -*-
import os
import codecs
from pyltp import Segmentor
#分词
def segmentor(sentence):
segmentor = Segmentor() # 初始化实例
segmentor.load('ltp_data/cws.model') # 加载模型
words = segmentor.segment(sentence) # 分词
words_list = list(words)
segmentor.release() # 释放模型
return words_list
f1 = codecs.open("d2w_ltp.txt","w")
sentence = '邓超,1979年出生于江西南昌,中国内地男演员、电影导演、投资出品人、互联网投资人。'
print "/".join(segmentor(sentence))
for i in segmentor(sentence):
f1.write(i)
f1.write(str(" "))
Copy after login
Effect: # -*- coding: UTF-8 -*- import os import codecs from pyltp import Segmentor #分词 def segmentor(sentence): segmentor = Segmentor() # 初始化实例 segmentor.load('ltp_data/cws.model') # 加载模型 words = segmentor.segment(sentence) # 分词 words_list = list(words) segmentor.release() # 释放模型 return words_list f1 = codecs.open("d2w_ltp.txt","w") sentence = '邓超,1979年出生于江西南昌,中国内地男演员、电影导演、投资出品人、互联网投资人。' print "/".join(segmentor(sentence)) for i in segmentor(sentence): f1.write(i) f1.write(str(" "))
邓/超/,/1979年/出生/于/江西/南昌/,/中国/内地/男/演员/、/电影/导演/、/投资/出品人/、/互联网/投资人/。
The above is the detailed content of Three commonly used Python Chinese word segmentation tools. For more information, please follow other related articles on the PHP Chinese website!

Hot Article

Hot tools Tags

Hot Article

Hot Article Tags

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics

What are the advantages and disadvantages of templating?

Google AI announces Gemini 1.5 Pro and Gemma 2 for developers

For only $250, Hugging Face's technical director teaches you how to fine-tune Llama 3 step by step

Share several .NET open source AI and LLM related project frameworks

A complete guide to golang function debugging and analysis
