如何衡量文本文档之间的相似度？-Python教程-PHP中文网

确定文本文档之间的相似度

首页

后端开发

Python教程

如何衡量文本文档之间的相似度？

DDD

Oct 23, 2024 am 06:55 AM

How to Measure the Similarity Between Text Documents?

确定文本文档之间的相似度

测量文档相似度

为了确定 NLP 中两个文本文档之间的相似度，标准方法是将文档转换为 TF-IDF 向量。然后利用这些向量来计算余弦相似度，这是信息检索系统中常用的一种度量。如需更深入的信息，请参阅在线电子书《信息检索简介》。

Python 中的实现

Python 提供了 Gensim 等库和 scikit-learn 有助于计算 TF-IDF 和余弦相似度。在 scikit-learn 中，计算文档之间的余弦相似度涉及利用它们的 TF-IDF 向量：

<code class="python">from sklearn.feature_extraction.text import TfidfVectorizer

documents = [open(f).read() for f in text_files]
tfidf = TfidfVectorizer().fit_transform(documents)
pairwise_similarity = tfidf * tfidf.T</code>

登录后复制

可以直接处理纯文本文档：

<code class="python">corpus = ["I'd like an apple", "An apple a day keeps the doctor away"]
tfidf = TfidfVectorizer(min_df=1, stop_words="english").fit_transform(corpus)
pairwise_similarity = tfidf * tfidf.T</code>

登录后复制

解释结果

生成的稀疏矩阵pairwise_similarity是正方形的。要识别与给定文档最相似的文档，您可以在屏蔽对角线元素（表示自相似性）后使用 NumPy 的 argmax 函数。

<code class="python">import numpy as np

arr = pairwise_similarity.toarray()
np.fill_diagonal(arr, np.nan)
input_doc = "Document to compare"
input_idx = corpus.index(input_doc)
result_idx = np.nanargmax(arr[input_idx])
most_similar_doc = corpus[result_idx]</code>

登录后复制

以上是如何衡量文本文档之间的相似度？的详细内容。更多信息请关注PHP中文网其他相关文章！

本站声明

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系admin@php.cn