Implement a small text classification system using python

高洛峰
Release: 2017-03-27 15:02:45
Original
4248 people have browsed it

Background

Text mining refers to the process of extracting unknown, understandable, and ultimately usable knowledge from large amounts of text data, and at the same time using this knowledge to better organize information for future reference. That is, the process of finding knowledge from unstructured text.

Currently there are 7 main areas of text mining:

  • ·Search and information retrieval IR

  • ·Text clustering : Use clustering methods to group and classify words, fragments, paragraphs or files

  • · Text Classification: Group and classify fragments, paragraphs or files while using data mining Based on the classification method, trained labeled instances Interconnection

  • · Information Extraction IE: Identify and extract relevant facts and relationships from unstructured text; extract structured extraction from unstructured or semi-structured text The process of structured data
  • · Natural language processing NLP: Discover the essential structure of language and its expressed meaning from the perspective of grammar and semantics
  • Text classification system (
  • python

    3.5)

  • Chinese language text classification technology and process mainly include the following steps:
1. Preprocessing: remove text noise information, such as

HTML tag, text format conversion, sentence boundary detection

2. Chinese word segmentation: Use Chinese word segmentation to segment the text and remove stop words


3. Build words Vector space: Count text word frequency and generate text word vector space4. Weight strategy - TF-IDF: Use TF-IDF to discover feature words and extract them as features that reflect the document theme

5. Classifiers: Use algorithms to train classifiers

6. Evaluate classification results

1. Preprocessing

a. Select the range of text to be processed

b. Establish a classified text corpus

·Training set corpus

    Text resources that have been classified into categories
  • · Test set corpus

    The text corpus to be classified can be part of the training set, or it can be text corpus from external sources
  • c. Text Format conversion: Use Python's l

    xml
  • library to remove html tags

d. Detect sentence boundaries: mark the end of the sentence

2. Chinese word segmentation Word segmentation is the process of recombining continuous word sequences into word sequences according to certain specifications. Chinese word segmentation is to divide a sequence of Chinese characters (sentences) into independent words. Chinese word segmentation is very complicated, and to some extent it is not completely An algorithmic problem. Finally, probability theory solved this problem. The algorithm is conditional random field (CRF) based on the probability graphical model.

Word segmentation is the most basic and lowest module in natural language processing. The accuracy of word segmentation is crucial to subsequent applications. Modules have a great influence. The structured representation of text or sentences is the core task in language processing. Currently, the structured representation of text is divided into four categories: word vector space, subject model, tree representation of dependent syntax,

Graph representation of RDF

.

The following is a sample code for Chinese words:

# -*- coding: utf-8 -*-import os
import jieba
def savefile(savepath, content):
fp = open(savepath,"w",encoding='gb2312', errors='ignore')
fp.write(content)
fp.close()
def readfile(path):
fp = open(path,"r", encoding= 'gb2312', errors='ignore')
content = fp.read()
fp.close()
return content
# corpus_path = "train_small/" # Unsegmented word classification prediction library path
# seg_path = "train_seg/" # Classification corpus path after word segmentation corpus_path = "test_small/" # Unsegmented word classification prediction library path seg_path = "test_seg/" # Classification after word segmentation Corpus pathcatelist= os.listdir(corpus_path) # Get all subdirectories under the changed directory for mydir in catelist:
class_path = corpus_path + mydir + "/" # Spell out the category subdirectory Path
seg_dir = seg_path + mydir + "/" # Predict the category directory after spelling out the word segmentation
if not os.path.exists(seg_dir): # Whether it exists, create it if it does not exist
os.makedirs(seg_dir)
file_list = os.listdir(class_path)
for file_pathin file_list:
fullname = class_path + file_path
content = readfile(fullname).strip() #Read filecontent
Content = content.replace("\r\n", "").strip() #Remove newlines and extra spaces
Content_seg = jieba.cut (Content)
Savefile (seg_dir + FILE_PATH, "". Join (Content_seg))
PRINT ("End of the Word")

## In order to be for For the convenience of subsequent generation of word vector space models, these word segmented text information must be converted into text vector information and objectified, using the Bunch data structure of the Scikit-Learn library. The specific code is as follows:

import os
import pickle
from sklearn.datasets.base import Bunch
#Bunch 类提供了一种key,value的对象形式
#target_name 所有分类集的名称列表
#label 每个文件的分类标签列表
#filenames 文件路径
#contents 分词后文件词向量形式def readfile(path):
    fp = open(path, "r", encoding='gb2312', errors='ignore')
    content = fp.read()
    fp.close()
    return content
bunch=Bunch(target_name=[],label=[],filenames=[],contents=[])
# wordbag_path="train_word_bag/train_set.dat"
# seg_path="train_seg/"wordbag_path="test_word_bag/test_set.dat"seg_path="test_seg/"catelist=os.listdir(seg_path)
bunch.target_name.extend(catelist)#将类别信息保存到Bunch对象for mydir in catelist:
    class_path=seg_path+mydir+"/"
    file_list=os.listdir(class_path)
    for file_path in file_list:
        fullname=class_path+file_path
        bunch.label.append(mydir)#保存当前文件的分类标签
        bunch.filenames.append(fullname)#保存当前文件的文件路径
        bunch.contents.append(readfile(fullname).strip())#保存文件词向量
#Bunch对象持久化file_obj=open(wordbag_path,"wb")
pickle.dump(bunch,file_obj)
file_obj.close()
print("构建文本对象结束")
Copy after login
3. Vector space model

Since the text stored in the vector space has a higher dimension, in order to save storage space and improve search efficiency, certain words will be

automatically filtered out before text classification. These words Words or phrases are called stop words. You can download this table of stop words here. 4. Weight strategy: TF-IDF method

If a word or phrase appears frequently in an article and rarely appears in other articles, then this word is considered Or the phrase has good category distinguishing ability and is suitable for classification.

Before giving this part of the code, let’s first look at the concepts of word frequency and reverse file frequency

Word frequency (TF): refers to the occurrence of a given word in the file Frequency of. This number is the normalization of the number of words to prevent it from being biased towards long files. For a word in a specific file, its importance can be expressed as:

The numerator is the number of words in the file The denominator is the sum of the number of occurrences of all words in the document

Inverse document frequency (IDF) is a measure of the general importance of a word. The IDF of a specific word can be calculated by the total document frequency Divide the number by the number of files containing the word, and then take the logarithm of the quotient:

|D| is the total number of files in the corpus, j is the number of files containing the word, if the word is not in the corpus , will cause the denominator to be zero, so generally an additional 1

is added to the denominator to calculate the product of word frequency and reverse file frequency, the frequency of high words in a specific file, and the frequency of the word in Low document frequency in the entire document collection can produce high-weighted TF-IDF, so TF-IDF tends to filter out common words and retain important words. The code is as follows:

import os
from sklearn.datasets.base import Bunch
import pickle#持久化类from sklearn import feature_extraction
from sklearn.feature_extraction.text import TfidfTransformer#TF-IDF向量转换类from sklearn.feature_extraction.text import TfidfVectorizer#TF-IDF向量生成类def readbunchobj(path):
    file_obj=open(path,"rb")
    bunch=pickle.load(file_obj)
    file_obj.close()
    return bunch
def writebunchobj(path,bunchobj):
    file_obj=open(path,"wb")
    pickle.dump(bunchobj,file_obj)
    file_obj.close()
def readfile(path):
    fp = open(path, "r", encoding='gb2312', errors='ignore')
    content = fp.read()
    fp.close()
    return content
path="train_word_bag/train_set.dat"bunch=readbunchobj(path)
#停用词stopword_path="train_word_bag/hlt_stop_words.txt"stpwrdlst=readfile(stopword_path).splitlines()
#构建TF-IDF词向量空间对象tfidfspace=Bunch(target_name=bunch.target_name,label=bunch.label,filenames=bunch.filenames,tdm=[],vocabulary={})
#使用TfidVectorizer初始化向量空间模型vectorizer=TfidfVectorizer(stop_words=stpwrdlst,sublinear_tf=True,max_df=0.5)
transfoemer=TfidfTransformer()#该类会统计每个词语的TF-IDF权值
#文本转为词频矩阵,单独保存字典文件tfidfspace.tdm=vectorizer.fit_transform(bunch.contents)
tfidfspace.vocabulary=vectorizer.vocabulary_
#创建词袋的持久化space_path="train_word_bag/tfidfspace.dat"writebunchobj(space_path,tfidfspace)
Copy after login

5. Use Naive Bayes classification module

Commonly used text classification methods include kNN nearest neighbor method, Naive Bayes algorithm and support vector machine algorithm. Generally speaking, :

kNN algorithm is originally the simplest, with acceptable classification accuracy, but it is the fastest.

The Naive Bayes algorithm has the best effect on short text classification, with high accuracy

The advantage of the support vector machine algorithm is that it supports linearly inseparable situations, and the accuracy is average

上文代码中进行操作的都是训练集的数据,下面是测试集(抽取字训练集),训练步骤和训练集相同,首先是分词,之后生成词向量文件,直至生成词向量模型,不同的是,在训练词向量模型时需要加载训练集词袋,将测试集产生的词向量映射到训练集词袋的词典中,生成向量空间模型,代码如下:

import os
from sklearn.datasets.base import Bunch
import pickle#持久化类from sklearn import feature_extraction
from sklearn.feature_extraction.text import TfidfTransformer#TF-IDF向量转换类from sklearn.feature_extraction.text import TfidfVectorizer#TF-IDF向量生成类from TF_IDF import space_path
def readbunchobj(path):
    file_obj=open(path,"rb")
    bunch=pickle.load(file_obj)
    file_obj.close()
    return bunch
def writebunchobj(path,bunchobj):
    file_obj=open(path,"wb")
    pickle.dump(bunchobj,file_obj)
    file_obj.close()
def readfile(path):
    fp = open(path, "r", encoding='gb2312', errors='ignore')
    content = fp.read()
    fp.close()
    return content
#导入分词后的词向量bunch对象path="test_word_bag/test_set.dat"bunch=readbunchobj(path)
#停用词stopword_path="train_word_bag/hlt_stop_words.txt"stpwrdlst=readfile(stopword_path).splitlines()
#构建测试集TF-IDF向量空间testspace=Bunch(target_name=bunch.target_name,label=bunch.label,filenames=bunch.filenames,tdm=[],vocabulary={})
#导入训练集的词袋trainbunch=readbunchobj("train_word_bag/tfidfspace.dat")
#使用TfidfVectorizer初始化向量空间vectorizer=TfidfVectorizer(stop_words=stpwrdlst,sublinear_tf=True,max_df=0.5,vocabulary=trainbunch.vocabulary)
transformer=TfidfTransformer();
testspace.tdm=vectorizer.fit_transform(bunch.contents)
testspace.vocabulary=trainbunch.vocabulary
#创建词袋的持久化space_path="test_word_bag/testspace.dat"writebunchobj(space_path,testspace)
Copy after login

下面执行多项式贝叶斯算法进行测试文本分类并返回精度,代码如下:

import pickle
from sklearn.naive_bayes import MultinomialNB  # 导入多项式贝叶斯算法包
def readbunchobj(path):
    file_obj = open(path, "rb")
    bunch = pickle.load(file_obj)
    file_obj.close()
    return bunch
# 导入训练集向量空间trainpath = "train_word_bag/tfidfspace.dat"train_set = readbunchobj(trainpath)
# d导入测试集向量空间testpath = "test_word_bag/testspace.dat"test_set = readbunchobj(testpath)
# 应用贝叶斯算法
# alpha:0.001 alpha 越小,迭代次数越多,精度越高clf = MultinomialNB(alpha=0.001).fit(train_set.tdm, train_set.label)
# 预测分类结果predicted = clf.predict(test_set.tdm)
total = len(predicted);rate = 0
for flabel, file_name, expct_cate in zip(test_set.label, test_set.filenames, predicted):
    if flabel != expct_cate:
        rate += 1
        print(file_name, ": 实际类别:", flabel, "-->预测分类:", expct_cate)
# 精度print("error_rate:", float(rate) * 100 / float(total), "%")
Copy after login

6.分类结果评估

机器学习领域的算法评估有三个基本指标:

  • · 召回率(recall rate,查全率):是检索出的相关文档数与文档库中所有相关文档的比率,衡量的是检索系统的查全率

召回率=系统检索到的相关文件/系统所有相关的文件综述

  • · 准确率(Precision,精度):是检索出的相关文档数于检索出的文档总数的比率,衡量的是检索系统的查准率

准确率=系统检索到的相关文件/系统所有的检索到的文件数

准确率和召回率是相互影响的,理想情况下是二者都高,但是一般情况下准确率高,召回率就低;召回率高,准确率就低

  • · F-Score():计算公式为:

当=1时就是最常见的-Measure

三者关系如下:

具体评估代码如下:

import numpy as np
from sklearn import metrics
#评估def metrics_result(actual,predict):
    print("精度:{0:.3f}".format(metrics.precision_score(actual,predict)))
    print("召回:{0:0.3f}".format(metrics.recall_score(actual,predict)))
    print("f1-score:{0:.3f}".format(metrics.f1_score(actual,predict)))
metrics_result(test_set.label,predicted)
中文文本语料
中文停用词文本集合
工程全部代码
原文链接
Copy after login

The above is the detailed content of Implement a small text classification system using python. For more information, please follow other related articles on the PHP Chinese website!

Related labels:
source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template