How to Calculate Similarity Between Text Documents Using TF-IDF and Cosine Similarity?-Python Tutorial-php.cn

How to Calculate Similarity Between Text Documents Using TF-IDF and Cosine Similarity?

Mary-Kate Olsen

Release： 2024-10-23 06:47:02

Original

535 people have browsed it

How to Calculate Similarity Between Text Documents Using TF-IDF and Cosine Similarity?

How to Calculate Text Document Similarity

Computing Pairwise Similarities

The most common method for determining the similarity between two text documents is to convert them into TF-IDF (Term Frequency-Inverse Document Frequency) vectors and then use cosine similarity to compare them. This approach is covered in textbooks on information retrieval and detailed in "Introduction to Information Retrieval."

Python libraries like Gensim and scikit-learn provide implementations of TF-IDF conversions and cosine similarity calculations. With scikit-learn, the following code snippet performs cosine similarity computations:

<code class="python">from sklearn.feature_extraction.text import TfidfVectorizer

# Extract documents from text files
documents = [open(f).read() for f in text_files]

# Create a TF-IDF vectorizer
tfidf = TfidfVectorizer().fit_transform(documents)

# Calculate pairwise cosine similarity
pairwise_similarity = tfidf * tfidf.T</code>

Copy after login

Alternatively, for plain text documents:

<code class="python">corpus = ["I'd like an apple", 
           "An apple a day keeps the doctor away", 
           "Never compare an apple to an orange", 
           "I prefer scikit-learn to Orange", 
           "The scikit-learn docs are Orange and Blue"]                                                                                                                                                                                                   

# Create a TF-IDF vectorizer with minimum frequency and exclusion of stop words
vect = TfidfVectorizer(min_df=1, stop_words="english")                                                                                                                                                                                                   

# Apply TF-IDF transformation
tfidf = vect.fit_transform(corpus)                                                                                                                                                                                                                       

# Calculate pairwise cosine similarity
pairwise_similarity = tfidf * tfidf.T </code>

Copy after login

Interpreting the Results

pairwise_similarity is a sparse matrix where each row and column represent a document in the corpus. Converting the sparse matrix to a NumPy array reveals that each cell represents the similarity between the two corresponding documents.

For instance, to determine the document most similar to "The scikit-learn docs are Orange and Blue," locate its index in the corpus and then apply np.nanargmax to the corresponding row after masking out the diagonal (representing self-similarity) with np.fill_diagonal():

<code class="python">import numpy as np

arr = pairwise_similarity.toarray()     
np.fill_diagonal(arr, np.nan)                                                                                                                                                                                                                            

input_doc = "The scikit-learn docs are Orange and Blue"                                                                                                                                                                                                  
input_idx = corpus.index(input_doc)                                                                                                                                                                                                                      
result_idx = np.nanargmax(arr[input_idx])                                                                                                                                                                                                                
print(corpus[result_idx])</code>

Copy after login

Note that for large datasets, using a sparse matrix conserves memory. Alternatively, consider using pairwise_similarity.shape to mask self-similarity and argmax() directly:

<code class="python">n, _ = pairwise_similarity.shape                                                                                                                                                                                                                         
pairwise_similarity[np.arange(n), np.arange(n)] = -1.0
pairwise_similarity[input_idx].argmax()  </code>

Copy after login

The above is the detailed content of How to Calculate Similarity Between Text Documents Using TF-IDF and Cosine Similarity?. For more information, please follow other related articles on the PHP Chinese website!