Function: Characterize text data
(sentences, phrases, words, letters) generally use words as feature values
Method 1: CountVectorizer
sklearn.feature_extraction.text.CountVectorizer(stop_words=[])
Returns the word frequency matrix (counts the number of feature words appearing in each sample)
CountVectorizer.fit_transform(X)
X: text or iterable object containing text string
Return value: return sparse matrix
CountVectorizer.inverse_transform(X)
X:array array or sparse matrix
Return value: data format before conversion
CountVectorizer .get_feature_names()
Return value: word list
Code display:
from sklearn.feature_extraction.text import CountVectorizer def count_demo(): #文本特征抽取 data=["life is short, i like like python","life is too long,i dislike python"] #1、实例化一个转换器类 transfer=CountVectorizer() #2、调用fit_transform() result=transfer.fit_transform(data) print("result:\n",result.toarray()) print("特征名字:\n", transfer.get_feature_names()) return None
Method 2: TfidfVectorizer
Keywords: in a certain In articles of a category, the number of occurrences is high, but the number of occurrences in articles of other categories is rarely called keywords
Tf-idf Text Feature Extraction
①The main idea of TF-IDF Yes: If a word or phrase has a high probability of appearing in an article and rarely appears in other articles, it is considered that the word or phrase has good category distinguishing ability and is suitable for classification.
②TF-IDF function: Used to evaluate the importance of a word to a document set or one of the documents in a corpus.
Formula
①Term frequency (tf) refers to the frequency of a given word appearing in the document
②Inverse document frequency (inverse document frequency, idf) is a measure of the general importance of a word. To calculate the idf of a term, divide the number of files containing the term by the total number of files and use the base 10 logarithm
tfidf = tf * idf
The output results can be understood as the degree of importance
API
##sklearn.feature_extraction.text.TfidfVectorizer(stop_words=None,...)Return the weight matrix of the word
TfidfVectorizer.fit_transform(X)X: text or iterable object containing text string Return value: Return sparse matrix
TfidfVectorizer.inverse_transform(X)X:array array or sparse matrixReturn value: Data format before conversion
TfidfVectorizer.get_feature_names()Return value: word listChinese word segmentation feature extraction
from sklearn.feature_extraction.text import TfidfVectorizer import jieba def cut_word(text): #中文分词 #jieba.cut(text)返回的是生成器对象,用list强转成列表 word=list(jieba.cut(text)) #转成字符串 words=" ".join(word) return words def tfidf_demo(): data = ["今天很残酷,明天更残酷,后天会很美好,但绝大多数人都死在明天晚上,却见不到后天的太阳,所以我们干什么都要坚持", "注重自己的名声,努力工作、与人为善、遵守诺言,这样对你们的事业非常有帮助", "服务是全世界最贵的产品,所以最佳的服务就是不要服务,最好的服务就是不需要服务"] data_new = [] # 将中文文本进行分词 for sentence in data: data_new.append(cut_word(sentence)) # 1、实例化一个转换器类 transfer = TfidfVectorizer() # 2、调用fit_transform() result = transfer.fit_transform(data_new) # 得到词频矩阵 是一个sparse矩阵 print("result:\n", result.toarray()) # 将sparse矩阵转化为二维数组 print("特征名字:\n", transfer.get_feature_names()) return None
The above is the detailed content of How Python sklearn performs feature extraction on text data. For more information, please follow other related articles on the PHP Chinese website!