Implement a small text classification system using python
Background
Text mining refers to the process of extracting unknown, understandable, and ultimately usable knowledge from large amounts of text data, and at the same time using this knowledge to better organize information for future reference. That is, the process of finding knowledge from unstructured text.
Currently there are 7 main areas of text mining:
·Search and information retrieval IR
·Text clustering : Use clustering methods to group and classify words, fragments, paragraphs or files
· Text Classification: Group and classify fragments, paragraphs or files while using data mining Based on the classification method, trained labeled instances Interconnection
· Information Extraction IE: Identify and extract relevant facts and relationships from unstructured text; extract structured extraction from unstructured or semi-structured text The process of structured data- · Natural language processing NLP: Discover the essential structure of language and its expressed meaning from the perspective of grammar and semantics
- Text classification system (
- python
3.5)
Chinese language text classification technology and process mainly include the following steps:
HTML tag, text format conversion, sentence boundary detection
2. Chinese word segmentation: Use Chinese word segmentation to segment the text and remove stop words
3. Build words Vector space: Count text word frequency and generate text word vector space4. Weight strategy - TF-IDF: Use TF-IDF to discover feature words and extract them as features that reflect the document theme
5. Classifiers: Use algorithms to train classifiers
6. Evaluate classification results
1. Preprocessing
a. Select the range of text to be processed
b. Establish a classified text corpus
·Training set corpus
- Text resources that have been classified into categories
- · Test set corpus
- The text corpus to be classified can be part of the training set, or it can be text corpus from external sources
-
c. Text Format conversion: Use Python's l
xml library to remove html tags
d. Detect sentence boundaries: mark the end of the sentence
2. Chinese word segmentation Word segmentation is the process of recombining continuous word sequences into word sequences according to certain specifications. Chinese word segmentation is to divide a sequence of Chinese characters (sentences) into independent words. Chinese word segmentation is very complicated, and to some extent it is not completely An algorithmic problem. Finally, probability theory solved this problem. The algorithm is conditional random field (CRF) based on the probability graphical model.
Word segmentation is the most basic and lowest module in natural language processing. The accuracy of word segmentation is crucial to subsequent applications. Modules have a great influence. The structured representation of text or sentences is the core task in language processing. Currently, the structured representation of text is divided into four categories: word vector space, subject model, tree representation of dependent syntax,
Graph representation of RDF.
The following is a sample code for Chinese words:
# -*- coding: utf-8
-*-import os
import jieba
def savefile(savepath, content):
fp = open(savepath,"w",encoding='gb2312', errors='ignore')
fp.write(content)
fp.close()
def readfile(path):
fp = open(path,"r", encoding= 'gb2312', errors='ignore')
content = fp.read()
fp.close()
return content
# corpus_path =
"train_small/" # Unsegmented word classification prediction library path
# seg_path = "train_seg/" # Classification corpus path after word segmentation corpus_path = "test_small/" # Unsegmented word classification prediction library path seg_path = "test_seg/" # Classification after word segmentation Corpus pathcatelist=
os.listdir(corpus_path) # Get all subdirectories under the changed directory for mydir in catelist:
class_path = corpus_path + mydir + "/" # Spell out the category subdirectory Path
seg_dir = seg_path + mydir + "/" # Predict the category directory after spelling out the word segmentation
if not os.path.exists(seg_dir): # Whether it exists, create it if it does not exist
os.makedirs(seg_dir)
file_list = os.listdir(class_path)
for file_pathin file_list:
fullname = class_path + file_path
content =
readfile(fullname).strip() #Read filecontent
Content = content.replace("\r\n", "").strip() #Remove newlines and extra spaces
Content_seg = jieba.cut (Content)
Savefile (seg_dir + FILE_PATH, "". Join (Content_seg))
PRINT ("End of the Word")
import os import pickle from sklearn.datasets.base import Bunch #Bunch 类提供了一种key,value的对象形式 #target_name 所有分类集的名称列表 #label 每个文件的分类标签列表 #filenames 文件路径 #contents 分词后文件词向量形式def readfile(path): fp = open(path, "r", encoding='gb2312', errors='ignore') content = fp.read() fp.close() return content bunch=Bunch(target_name=[],label=[],filenames=[],contents=[]) # wordbag_path="train_word_bag/train_set.dat" # seg_path="train_seg/"wordbag_path="test_word_bag/test_set.dat"seg_path="test_seg/"catelist=os.listdir(seg_path) bunch.target_name.extend(catelist)#将类别信息保存到Bunch对象for mydir in catelist: class_path=seg_path+mydir+"/" file_list=os.listdir(class_path) for file_path in file_list: fullname=class_path+file_path bunch.label.append(mydir)#保存当前文件的分类标签 bunch.filenames.append(fullname)#保存当前文件的文件路径 bunch.contents.append(readfile(fullname).strip())#保存文件词向量 #Bunch对象持久化file_obj=open(wordbag_path,"wb") pickle.dump(bunch,file_obj) file_obj.close() print("构建文本对象结束")
automatically filtered out before text classification. These words Words or phrases are called stop words. You can download this table of stop words here. 4. Weight strategy: TF-IDF method
If a word or phrase appears frequently in an article and rarely appears in other articles, then this word is considered Or the phrase has good category distinguishing ability and is suitable for classification.
Before giving this part of the code, let’s first look at the concepts of word frequency and reverse file frequency
Word frequency (TF): refers to the occurrence of a given word in the file Frequency of. This number is the normalization of the number of words to prevent it from being biased towards long files. For a word in a specific file, its importance can be expressed as:
The numerator is the number of words in the file The denominator is the sum of the number of occurrences of all words in the document
Inverse document frequency (IDF) is a measure of the general importance of a word. The IDF of a specific word can be calculated by the total document frequency Divide the number by the number of files containing the word, and then take the logarithm of the quotient:
|D| is the total number of files in the corpus, j is the number of files containing the word, if the word is not in the corpus , will cause the denominator to be zero, so generally an additional 1
is added to the denominator to calculate the product of word frequency and reverse file frequency, the frequency of high words in a specific file, and the frequency of the word in Low document frequency in the entire document collection can produce high-weighted TF-IDF, so TF-IDF tends to filter out common words and retain important words. The code is as follows:
import os from sklearn.datasets.base import Bunch import pickle#持久化类from sklearn import feature_extraction from sklearn.feature_extraction.text import TfidfTransformer#TF-IDF向量转换类from sklearn.feature_extraction.text import TfidfVectorizer#TF-IDF向量生成类def readbunchobj(path): file_obj=open(path,"rb") bunch=pickle.load(file_obj) file_obj.close() return bunch def writebunchobj(path,bunchobj): file_obj=open(path,"wb") pickle.dump(bunchobj,file_obj) file_obj.close() def readfile(path): fp = open(path, "r", encoding='gb2312', errors='ignore') content = fp.read() fp.close() return content path="train_word_bag/train_set.dat"bunch=readbunchobj(path) #停用词stopword_path="train_word_bag/hlt_stop_words.txt"stpwrdlst=readfile(stopword_path).splitlines() #构建TF-IDF词向量空间对象tfidfspace=Bunch(target_name=bunch.target_name,label=bunch.label,filenames=bunch.filenames,tdm=[],vocabulary={}) #使用TfidVectorizer初始化向量空间模型vectorizer=TfidfVectorizer(stop_words=stpwrdlst,sublinear_tf=True,max_df=0.5) transfoemer=TfidfTransformer()#该类会统计每个词语的TF-IDF权值 #文本转为词频矩阵,单独保存字典文件tfidfspace.tdm=vectorizer.fit_transform(bunch.contents) tfidfspace.vocabulary=vectorizer.vocabulary_ #创建词袋的持久化space_path="train_word_bag/tfidfspace.dat"writebunchobj(space_path,tfidfspace)
5. Use Naive Bayes classification module
Commonly used text classification methods include kNN nearest neighbor method, Naive Bayes algorithm and support vector machine algorithm. Generally speaking, :
kNN algorithm is originally the simplest, with acceptable classification accuracy, but it is the fastest.
The Naive Bayes algorithm has the best effect on short text classification, with high accuracy
The advantage of the support vector machine algorithm is that it supports linearly inseparable situations, and the accuracy is average
上文代码中进行操作的都是训练集的数据,下面是测试集(抽取字训练集),训练步骤和训练集相同,首先是分词,之后生成词向量文件,直至生成词向量模型,不同的是,在训练词向量模型时需要加载训练集词袋,将测试集产生的词向量映射到训练集词袋的词典中,生成向量空间模型,代码如下:
import os from sklearn.datasets.base import Bunch import pickle#持久化类from sklearn import feature_extraction from sklearn.feature_extraction.text import TfidfTransformer#TF-IDF向量转换类from sklearn.feature_extraction.text import TfidfVectorizer#TF-IDF向量生成类from TF_IDF import space_path def readbunchobj(path): file_obj=open(path,"rb") bunch=pickle.load(file_obj) file_obj.close() return bunch def writebunchobj(path,bunchobj): file_obj=open(path,"wb") pickle.dump(bunchobj,file_obj) file_obj.close() def readfile(path): fp = open(path, "r", encoding='gb2312', errors='ignore') content = fp.read() fp.close() return content #导入分词后的词向量bunch对象path="test_word_bag/test_set.dat"bunch=readbunchobj(path) #停用词stopword_path="train_word_bag/hlt_stop_words.txt"stpwrdlst=readfile(stopword_path).splitlines() #构建测试集TF-IDF向量空间testspace=Bunch(target_name=bunch.target_name,label=bunch.label,filenames=bunch.filenames,tdm=[],vocabulary={}) #导入训练集的词袋trainbunch=readbunchobj("train_word_bag/tfidfspace.dat") #使用TfidfVectorizer初始化向量空间vectorizer=TfidfVectorizer(stop_words=stpwrdlst,sublinear_tf=True,max_df=0.5,vocabulary=trainbunch.vocabulary) transformer=TfidfTransformer(); testspace.tdm=vectorizer.fit_transform(bunch.contents) testspace.vocabulary=trainbunch.vocabulary #创建词袋的持久化space_path="test_word_bag/testspace.dat"writebunchobj(space_path,testspace)
下面执行多项式贝叶斯算法进行测试文本分类并返回精度,代码如下:
import pickle from sklearn.naive_bayes import MultinomialNB # 导入多项式贝叶斯算法包 def readbunchobj(path): file_obj = open(path, "rb") bunch = pickle.load(file_obj) file_obj.close() return bunch # 导入训练集向量空间trainpath = "train_word_bag/tfidfspace.dat"train_set = readbunchobj(trainpath) # d导入测试集向量空间testpath = "test_word_bag/testspace.dat"test_set = readbunchobj(testpath) # 应用贝叶斯算法 # alpha:0.001 alpha 越小,迭代次数越多,精度越高clf = MultinomialNB(alpha=0.001).fit(train_set.tdm, train_set.label) # 预测分类结果predicted = clf.predict(test_set.tdm) total = len(predicted);rate = 0 for flabel, file_name, expct_cate in zip(test_set.label, test_set.filenames, predicted): if flabel != expct_cate: rate += 1 print(file_name, ": 实际类别:", flabel, "-->预测分类:", expct_cate) # 精度print("error_rate:", float(rate) * 100 / float(total), "%")
6.分类结果评估
机器学习领域的算法评估有三个基本指标:
· 召回率(recall rate,查全率):是检索出的相关文档数与文档库中所有相关文档的比率,衡量的是检索系统的查全率
召回率=系统检索到的相关文件/系统所有相关的文件综述
· 准确率(Precision,精度):是检索出的相关文档数于检索出的文档总数的比率,衡量的是检索系统的查准率
准确率=系统检索到的相关文件/系统所有的检索到的文件数
准确率和召回率是相互影响的,理想情况下是二者都高,但是一般情况下准确率高,召回率就低;召回率高,准确率就低
· F-Score():计算公式为:
当=1时就是最常见的-Measure
三者关系如下:
具体评估代码如下:
import numpy as np from sklearn import metrics #评估def metrics_result(actual,predict): print("精度:{0:.3f}".format(metrics.precision_score(actual,predict))) print("召回:{0:0.3f}".format(metrics.recall_score(actual,predict))) print("f1-score:{0:.3f}".format(metrics.f1_score(actual,predict))) metrics_result(test_set.label,predicted) 中文文本语料 中文停用词文本集合 工程全部代码 原文链接
The above is the detailed content of Implement a small text classification system using python. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics



MySQL has a free community version and a paid enterprise version. The community version can be used and modified for free, but the support is limited and is suitable for applications with low stability requirements and strong technical capabilities. The Enterprise Edition provides comprehensive commercial support for applications that require a stable, reliable, high-performance database and willing to pay for support. Factors considered when choosing a version include application criticality, budgeting, and technical skills. There is no perfect option, only the most suitable option, and you need to choose carefully according to the specific situation.

HadiDB: A lightweight, high-level scalable Python database HadiDB (hadidb) is a lightweight database written in Python, with a high level of scalability. Install HadiDB using pip installation: pipinstallhadidb User Management Create user: createuser() method to create a new user. The authentication() method authenticates the user's identity. fromhadidb.operationimportuseruser_obj=user("admin","admin")user_obj.

It is impossible to view MongoDB password directly through Navicat because it is stored as hash values. How to retrieve lost passwords: 1. Reset passwords; 2. Check configuration files (may contain hash values); 3. Check codes (may hardcode passwords).

MySQL can run without network connections for basic data storage and management. However, network connection is required for interaction with other systems, remote access, or using advanced features such as replication and clustering. Additionally, security measures (such as firewalls), performance optimization (choose the right network connection), and data backup are critical to connecting to the Internet.

MySQL Workbench can connect to MariaDB, provided that the configuration is correct. First select "MariaDB" as the connector type. In the connection configuration, set HOST, PORT, USER, PASSWORD, and DATABASE correctly. When testing the connection, check that the MariaDB service is started, whether the username and password are correct, whether the port number is correct, whether the firewall allows connections, and whether the database exists. In advanced usage, use connection pooling technology to optimize performance. Common errors include insufficient permissions, network connection problems, etc. When debugging errors, carefully analyze error information and use debugging tools. Optimizing network configuration can improve performance

MySQL database performance optimization guide In resource-intensive applications, MySQL database plays a crucial role and is responsible for managing massive transactions. However, as the scale of application expands, database performance bottlenecks often become a constraint. This article will explore a series of effective MySQL performance optimization strategies to ensure that your application remains efficient and responsive under high loads. We will combine actual cases to explain in-depth key technologies such as indexing, query optimization, database design and caching. 1. Database architecture design and optimized database architecture is the cornerstone of MySQL performance optimization. Here are some core principles: Selecting the right data type and selecting the smallest data type that meets the needs can not only save storage space, but also improve data processing speed.

The MySQL connection may be due to the following reasons: MySQL service is not started, the firewall intercepts the connection, the port number is incorrect, the user name or password is incorrect, the listening address in my.cnf is improperly configured, etc. The troubleshooting steps include: 1. Check whether the MySQL service is running; 2. Adjust the firewall settings to allow MySQL to listen to port 3306; 3. Confirm that the port number is consistent with the actual port number; 4. Check whether the user name and password are correct; 5. Make sure the bind-address settings in my.cnf are correct.

As a data professional, you need to process large amounts of data from various sources. This can pose challenges to data management and analysis. Fortunately, two AWS services can help: AWS Glue and Amazon Athena.
