Text classification is one of the Natural Language Processing (NLP) tasks, which aims to classify text into predefined categories . Text classification has many practical applications, such as email filtering, spam detection, sentiment analysis, and question answering systems, etc.
Using python The task of text classification using the NLTK library can be divided into the following steps:
The following is an example of using the Python NLTK library to complete text classification:
from nltk.corpus import stopWords from nltk.tokenize import word_tokenize from nltk.stem import PorterStemmer from nltk.classify import NaiveBayesClassifier # 加载数据 data = [("我爱北京", "积极"), ("我讨厌北京", "消极")] # 数据预处理 stop_words = set(stopwords.words("english")) stemmer = PorterStemmer() processed_data = [] for text, label in data: tokens = word_tokenize(text) filtered_tokens = [token for token in tokens if token not in stop_words] stemmed_tokens = [stemmer.stem(token) for token in filtered_tokens] processed_data.append((stemmed_tokens, label)) # 特征提取 all_words = [word for sentence, label in processed_data for word in sentence] word_features = list(set(all_words)) def document_features(document): document_words = set(document) features = {} for word in word_features: features["contains({})".fORMat(word)] = (word in document_words) return features feature_sets = [(document_features(sentence), label) for sentence, label in processed_data] # 模型训练 classifier = NaiveBayesClassifier.train(feature_sets) # 模型评估 print(classifier.accuracy(feature_sets))
In the above example, we used the Naive Bayes classifier to classify text. We can see that the accuracy of the classifier reaches 100%.
Text classification is a challenging task, but various techniques can be used to improve the accuracy of the classifier. For example, we can use more features to train the classifier, or we can use more powerful classifiers such as support vector machines or decision trees.
The above is the detailed content of [Python NLTK] Text classification, easily solve text classification problems. For more information, please follow other related articles on the PHP Chinese website!