How to Measure the Similarity Between Text Documents?

DDD
Release: 2024-10-23 06:55:02
Original
985 people have browsed it

How to Measure the Similarity Between Text Documents?

Determining the Similarity Between Text Documents

Measuring Document Similarity

To ascertain the similarity between two text documents in NLP, the standard approach involves transforming the documents into TF-IDF vectors. These vectors are then utilized to calculate the cosine similarity, a metric commonly employed in information retrieval systems. For more in-depth information, refer to "Introduction to Information Retrieval," an e-book available online.

Implementation in Python

Python provides libraries such as Gensim and scikit-learn that facilitate the calculation of TF-IDF and cosine similarity. In scikit-learn, computing the cosine similarity between documents involves utilizing their TF-IDF vectors:

<code class="python">from sklearn.feature_extraction.text import TfidfVectorizer

documents = [open(f).read() for f in text_files]
tfidf = TfidfVectorizer().fit_transform(documents)
pairwise_similarity = tfidf * tfidf.T</code>
Copy after login

Plain text documents can be processed directly:

<code class="python">corpus = ["I'd like an apple", "An apple a day keeps the doctor away"]
tfidf = TfidfVectorizer(min_df=1, stop_words="english").fit_transform(corpus)
pairwise_similarity = tfidf * tfidf.T</code>
Copy after login

Interpreting the Results

The resulting sparse matrix pairwise_similarity is square-shaped. To identify the most similar document to a given document, you can utilize NumPy's argmax function, after masking the diagonal elements (representing self-similarity).

<code class="python">import numpy as np

arr = pairwise_similarity.toarray()
np.fill_diagonal(arr, np.nan)
input_doc = "Document to compare"
input_idx = corpus.index(input_doc)
result_idx = np.nanargmax(arr[input_idx])
most_similar_doc = corpus[result_idx]</code>
Copy after login

The above is the detailed content of How to Measure the Similarity Between Text Documents?. For more information, please follow other related articles on the PHP Chinese website!

source:php
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template
About us Disclaimer Sitemap
php.cn:Public welfare online PHP training,Help PHP learners grow quickly!