Implement a small text classification system using python-Python Tutorial-php.cn

Table of Contents

Background

Home

Backend Development

Python Tutorial

Implement a small text classification system using python

高洛峰

Mar 27, 2017 pm 03:02 PM

python

Background

Text mining refers to the process of extracting unknown, understandable, and ultimately usable knowledge from large amounts of text data, and at the same time using this knowledge to better organize information for future reference. That is, the process of finding knowledge from unstructured text.

Currently there are 7 main areas of text mining:

·Search and information retrieval IR
·Text clustering : Use clustering methods to group and classify words, fragments, paragraphs or files
· Text Classification: Group and classify fragments, paragraphs or files while using data mining Based on the classification method, trained labeled instances Interconnection
python
3.5)

1. Preprocessing: remove text noise information, such as

HTML tag, text format conversion, sentence boundary detection

2. Chinese word segmentation: Use Chinese word segmentation to segment the text and remove stop words

3. Build words Vector space: Count text word frequency and generate text word vector space4. Weight strategy - TF-IDF: Use TF-IDF to discover feature words and extract them as features that reflect the document theme

5. Classifiers: Use algorithms to train classifiers

6. Evaluate classification results

1. Preprocessing

a. Select the range of text to be processed

b. Establish a classified text corpus

·Training set corpus

c. Text Format conversion: Use Python's l
xml

d. Detect sentence boundaries: mark the end of the sentence

2. Chinese word segmentation Word segmentation is the process of recombining continuous word sequences into word sequences according to certain specifications. Chinese word segmentation is to divide a sequence of Chinese characters (sentences) into independent words. Chinese word segmentation is very complicated, and to some extent it is not completely An algorithmic problem. Finally, probability theory solved this problem. The algorithm is conditional random field (CRF) based on the probability graphical model.

Word segmentation is the most basic and lowest module in natural language processing. The accuracy of word segmentation is crucial to subsequent applications. Modules have a great influence. The structured representation of text or sentences is the core task in language processing. Currently, the structured representation of text is divided into four categories: word vector space, subject model, tree representation of dependent syntax,

Graph representation of RDF

The following is a sample code for Chinese words:

# -*- coding: utf-8 -*-import os
import jieba
def savefile(savepath, content):
fp = open(savepath,"w",encoding='gb2312', errors='ignore')
fp.write(content)
fp.close()
def readfile(path):
fp = open(path,"r", encoding= 'gb2312', errors='ignore')
content = fp.read()
fp.close()
return content
# corpus_path = "train_small/" # Unsegmented word classification prediction library path
# seg_path = "train_seg/" # Classification corpus path after word segmentation corpus_path = "test_small/" # Unsegmented word classification prediction library path seg_path = "test_seg/" # Classification after word segmentation Corpus pathcatelist= os.listdir(corpus_path) # Get all subdirectories under the changed directory for mydir in catelist:
class_path = corpus_path + mydir + "/" # Spell out the category subdirectory Path
seg_dir = seg_path + mydir + "/" # Predict the category directory after spelling out the word segmentation
if not os.path.exists(seg_dir): # Whether it exists, create it if it does not exist
os.makedirs(seg_dir)
file_list = os.listdir(class_path)
for file_pathin file_list:
fullname = class_path + file_path
content = readfile(fullname).strip() #Read filecontent
Content = content.replace("\r\n", "").strip() #Remove newlines and extra spaces
Content_seg = jieba.cut (Content)
Savefile (seg_dir + FILE_PATH, "". Join (Content_seg))
PRINT ("End of the Word")

## In order to be for For the convenience of subsequent generation of word vector space models, these word segmented text information must be converted into text vector information and objectified, using the Bunch data structure of the Scikit-Learn library. The specific code is as follows:

import os
import pickle
from sklearn.datasets.base import Bunch
#Bunch 类提供了一种key，value的对象形式
#target_name 所有分类集的名称列表
#label 每个文件的分类标签列表
#filenames 文件路径
#contents 分词后文件词向量形式def readfile(path):
    fp = open(path, "r", encoding='gb2312', errors='ignore')
    content = fp.read()
    fp.close()
    return content
bunch=Bunch(target_name=[],label=[],filenames=[],contents=[])
# wordbag_path="train_word_bag/train_set.dat"
# seg_path="train_seg/"wordbag_path="test_word_bag/test_set.dat"seg_path="test_seg/"catelist=os.listdir(seg_path)
bunch.target_name.extend(catelist)#将类别信息保存到Bunch对象for mydir in catelist:
    class_path=seg_path+mydir+"/"
    file_list=os.listdir(class_path)
    for file_path in file_list:
        fullname=class_path+file_path
        bunch.label.append(mydir)#保存当前文件的分类标签
        bunch.filenames.append(fullname)#保存当前文件的文件路径
        bunch.contents.append(readfile(fullname).strip())#保存文件词向量
#Bunch对象持久化file_obj=open(wordbag_path,"wb")
pickle.dump(bunch,file_obj)
file_obj.close()
print("构建文本对象结束")

Copy after login

3. Vector space model

Since the text stored in the vector space has a higher dimension, in order to save storage space and improve search efficiency, certain words will be

automatically filtered out before text classification. These words Words or phrases are called stop words. You can download this table of stop words here. 4. Weight strategy: TF-IDF method

If a word or phrase appears frequently in an article and rarely appears in other articles, then this word is considered Or the phrase has good category distinguishing ability and is suitable for classification.

Before giving this part of the code, let’s first look at the concepts of word frequency and reverse file frequency

Word frequency (TF): refers to the occurrence of a given word in the file Frequency of. This number is the normalization of the number of words to prevent it from being biased towards long files. For a word in a specific file, its importance can be expressed as:

The numerator is the number of words in the file The denominator is the sum of the number of occurrences of all words in the document

Inverse document frequency (IDF) is a measure of the general importance of a word. The IDF of a specific word can be calculated by the total document frequency Divide the number by the number of files containing the word, and then take the logarithm of the quotient:

|D| is the total number of files in the corpus, j is the number of files containing the word, if the word is not in the corpus , will cause the denominator to be zero, so generally an additional 1

is added to the denominator to calculate the product of word frequency and reverse file frequency, the frequency of high words in a specific file, and the frequency of the word in Low document frequency in the entire document collection can produce high-weighted TF-IDF, so TF-IDF tends to filter out common words and retain important words. The code is as follows:

import os
from sklearn.datasets.base import Bunch
import pickle#持久化类from sklearn import feature_extraction
from sklearn.feature_extraction.text import TfidfTransformer#TF-IDF向量转换类from sklearn.feature_extraction.text import TfidfVectorizer#TF-IDF向量生成类def readbunchobj(path):
    file_obj=open(path,"rb")
    bunch=pickle.load(file_obj)
    file_obj.close()
    return bunch
def writebunchobj(path,bunchobj):
    file_obj=open(path,"wb")
    pickle.dump(bunchobj,file_obj)
    file_obj.close()
def readfile(path):
    fp = open(path, "r", encoding='gb2312', errors='ignore')
    content = fp.read()
    fp.close()
    return content
path="train_word_bag/train_set.dat"bunch=readbunchobj(path)
#停用词stopword_path="train_word_bag/hlt_stop_words.txt"stpwrdlst=readfile(stopword_path).splitlines()
#构建TF-IDF词向量空间对象tfidfspace=Bunch(target_name=bunch.target_name,label=bunch.label,filenames=bunch.filenames,tdm=[],vocabulary={})
#使用TfidVectorizer初始化向量空间模型vectorizer=TfidfVectorizer(stop_words=stpwrdlst,sublinear_tf=True,max_df=0.5)
transfoemer=TfidfTransformer()#该类会统计每个词语的TF-IDF权值
#文本转为词频矩阵，单独保存字典文件tfidfspace.tdm=vectorizer.fit_transform(bunch.contents)
tfidfspace.vocabulary=vectorizer.vocabulary_
#创建词袋的持久化space_path="train_word_bag/tfidfspace.dat"writebunchobj(space_path,tfidfspace)

Copy after login

5. Use Naive Bayes classification module

Commonly used text classification methods include kNN nearest neighbor method, Naive Bayes algorithm and support vector machine algorithm. Generally speaking, :

kNN algorithm is originally the simplest, with acceptable classification accuracy, but it is the fastest.

The Naive Bayes algorithm has the best effect on short text classification, with high accuracy

The advantage of the support vector machine algorithm is that it supports linearly inseparable situations, and the accuracy is average

上文代码中进行操作的都是训练集的数据，下面是测试集（抽取字训练集），训练步骤和训练集相同，首先是分词，之后生成词向量文件，直至生成词向量模型，不同的是，在训练词向量模型时需要加载训练集词袋，将测试集产生的词向量映射到训练集词袋的词典中，生成向量空间模型，代码如下：

import os
from sklearn.datasets.base import Bunch
import pickle#持久化类from sklearn import feature_extraction
from sklearn.feature_extraction.text import TfidfTransformer#TF-IDF向量转换类from sklearn.feature_extraction.text import TfidfVectorizer#TF-IDF向量生成类from TF_IDF import space_path
def readbunchobj(path):
    file_obj=open(path,"rb")
    bunch=pickle.load(file_obj)
    file_obj.close()
    return bunch
def writebunchobj(path,bunchobj):
    file_obj=open(path,"wb")
    pickle.dump(bunchobj,file_obj)
    file_obj.close()
def readfile(path):
    fp = open(path, "r", encoding='gb2312', errors='ignore')
    content = fp.read()
    fp.close()
    return content
#导入分词后的词向量bunch对象path="test_word_bag/test_set.dat"bunch=readbunchobj(path)
#停用词stopword_path="train_word_bag/hlt_stop_words.txt"stpwrdlst=readfile(stopword_path).splitlines()
#构建测试集TF-IDF向量空间testspace=Bunch(target_name=bunch.target_name,label=bunch.label,filenames=bunch.filenames,tdm=[],vocabulary={})
#导入训练集的词袋trainbunch=readbunchobj("train_word_bag/tfidfspace.dat")
#使用TfidfVectorizer初始化向量空间vectorizer=TfidfVectorizer(stop_words=stpwrdlst,sublinear_tf=True,max_df=0.5,vocabulary=trainbunch.vocabulary)
transformer=TfidfTransformer();
testspace.tdm=vectorizer.fit_transform(bunch.contents)
testspace.vocabulary=trainbunch.vocabulary
#创建词袋的持久化space_path="test_word_bag/testspace.dat"writebunchobj(space_path,testspace)

Copy after login

下面执行多项式贝叶斯算法进行测试文本分类并返回精度，代码如下：

import pickle
from sklearn.naive_bayes import MultinomialNB  # 导入多项式贝叶斯算法包
def readbunchobj(path):
    file_obj = open(path, "rb")
    bunch = pickle.load(file_obj)
    file_obj.close()
    return bunch
# 导入训练集向量空间trainpath = "train_word_bag/tfidfspace.dat"train_set = readbunchobj(trainpath)
# d导入测试集向量空间testpath = "test_word_bag/testspace.dat"test_set = readbunchobj(testpath)
# 应用贝叶斯算法
# alpha:0.001 alpha 越小，迭代次数越多，精度越高clf = MultinomialNB(alpha=0.001).fit(train_set.tdm, train_set.label)
# 预测分类结果predicted = clf.predict(test_set.tdm)
total = len(predicted);rate = 0
for flabel, file_name, expct_cate in zip(test_set.label, test_set.filenames, predicted):
    if flabel != expct_cate:
        rate += 1
        print(file_name, ": 实际类别：", flabel, "-->预测分类：", expct_cate)
# 精度print("error_rate:", float(rate) * 100 / float(total), "%")

Copy after login

6.分类结果评估

机器学习领域的算法评估有三个基本指标：

· 召回率（recall rate,查全率）：是检索出的相关文档数与文档库中所有相关文档的比率，衡量的是检索系统的查全率

召回率=系统检索到的相关文件/系统所有相关的文件综述

· 准确率（Precision，精度）：是检索出的相关文档数于检索出的文档总数的比率，衡量的是检索系统的查准率

准确率=系统检索到的相关文件/系统所有的检索到的文件数

准确率和召回率是相互影响的，理想情况下是二者都高，但是一般情况下准确率高，召回率就低；召回率高，准确率就低

· F-Score（）：计算公式为：

当=1时就是最常见的-Measure

三者关系如下：

具体评估代码如下：

import numpy as np
from sklearn import metrics
#评估def metrics_result(actual,predict):
    print("精度：{0:.3f}".format(metrics.precision_score(actual,predict)))
    print("召回：{0:0.3f}".format(metrics.recall_score(actual,predict)))
    print("f1-score:{0:.3f}".format(metrics.f1_score(actual,predict)))
metrics_result(test_set.label,predicted)
中文文本语料
中文停用词文本集合
工程全部代码
原文链接

Copy after login

The above is the detailed content of Implement a small text classification system using python. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

R.E.P.O. Best Graphic Settings

3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Assassin's Creed Shadows: Seashell Riddle Solution

2 weeks ago By DDD

R.E.P.O. How to Fix Audio if You Can't Hear Anyone

3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

WWE 2K25: How To Unlock Everything In MyRise

4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7490

CakePHP Tutorial

1377

What is the format of the account name of steam

win11 activation key permanent

nyt connections hints and answers

Related knowledge

Do mysql need to pay Apr 08, 2025 pm 05:36 PM

MySQL has a free community version and a paid enterprise version. The community version can be used and modified for free, but the support is limited and is suitable for applications with low stability requirements and strong technical capabilities. The Enterprise Edition provides comprehensive commercial support for applications that require a stable, reliable, high-performance database and willing to pay for support. Factors considered when choosing a version include application criticality, budgeting, and technical skills. There is no perfect option, only the most suitable option, and you need to choose carefully according to the specific situation.

HadiDB: A lightweight, horizontally scalable database in Python Apr 08, 2025 pm 06:12 PM

HadiDB: A lightweight, high-level scalable Python database HadiDB (hadidb) is a lightweight database written in Python, with a high level of scalability. Install HadiDB using pip installation: pipinstallhadidb User Management Create user: createuser() method to create a new user. The authentication() method authenticates the user's identity. fromhadidb.operationimportuseruser_obj=user("admin","admin")user_obj.

Navicat's method to view MongoDB database password Apr 08, 2025 pm 09:39 PM

It is impossible to view MongoDB password directly through Navicat because it is stored as hash values. How to retrieve lost passwords: 1. Reset passwords; 2. Check configuration files (may contain hash values); 3. Check codes (may hardcode passwords).

Does mysql need the internet Apr 08, 2025 pm 02:18 PM

MySQL can run without network connections for basic data storage and management. However, network connection is required for interaction with other systems, remote access, or using advanced features such as replication and clustering. Additionally, security measures (such as firewalls), performance optimization (choose the right network connection), and data backup are critical to connecting to the Internet.

Can mysql workbench connect to mariadb Apr 08, 2025 pm 02:33 PM

MySQL Workbench can connect to MariaDB, provided that the configuration is correct. First select "MariaDB" as the connector type. In the connection configuration, set HOST, PORT, USER, PASSWORD, and DATABASE correctly. When testing the connection, check that the MariaDB service is started, whether the username and password are correct, whether the port number is correct, whether the firewall allows connections, and whether the database exists. In advanced usage, use connection pooling technology to optimize performance. Common errors include insufficient permissions, network connection problems, etc. When debugging errors, carefully analyze error information and use debugging tools. Optimizing network configuration can improve performance

How to optimize MySQL performance for high-load applications? Apr 08, 2025 pm 06:03 PM

MySQL database performance optimization guide In resource-intensive applications, MySQL database plays a crucial role and is responsible for managing massive transactions. However, as the scale of application expands, database performance bottlenecks often become a constraint. This article will explore a series of effective MySQL performance optimization strategies to ensure that your application remains efficient and responsive under high loads. We will combine actual cases to explain in-depth key technologies such as indexing, query optimization, database design and caching. 1. Database architecture design and optimized database architecture is the cornerstone of MySQL performance optimization. Here are some core principles: Selecting the right data type and selecting the smallest data type that meets the needs can not only save storage space, but also improve data processing speed.

How to solve mysql cannot connect to local host Apr 08, 2025 pm 02:24 PM

The MySQL connection may be due to the following reasons: MySQL service is not started, the firewall intercepts the connection, the port number is incorrect, the user name or password is incorrect, the listening address in my.cnf is improperly configured, etc. The troubleshooting steps include: 1. Check whether the MySQL service is running; 2. Adjust the firewall settings to allow MySQL to listen to port 3306; 3. Confirm that the port number is consistent with the actual port number; 4. Check whether the user name and password are correct; 5. Make sure the bind-address settings in my.cnf are correct.

How to use AWS Glue crawler with Amazon Athena Apr 09, 2025 pm 03:09 PM

As a data professional, you need to process large amounts of data from various sources. This can pose challenges to data management and analysis. Fortunately, two AWS services can help: AWS Glue and Amazon Athena.

See all articles