You have already given the first step. First segment the articles into Chinese words, and then calculate the tf-idf value of each word in the two articles. Then calculate the cosine similarity of the two articles, which can be implemented using gensim in Python.
If you have any questions, please continue to ask.
Please add to the answer on the first floor When using cosine similarity or TF-IDF, stop words should be removed first.
Stop word is translated from the English word: stopword. It turns out that in English, you will encounter many frequently used words or words such as a, the, or, etc., often articles, prepositions, adverbs or conjunctions, etc. Because words such as adverbs and conjunctions do not greatly affect our judgment of semantics.
But simple cosine similarity and TF-IDF are not very reliable under certain circumstances. Push your own link 2333 here
It is recommended to use textrank in combination with the above algorithm
You have already given the first step. First segment the articles into Chinese words, and then calculate the tf-idf value of each word in the two articles. Then calculate the cosine similarity of the two articles, which can be implemented using gensim in Python.
If you have any questions, please continue to ask.
Please add to the answer on the first floor
When using cosine similarity or TF-IDF, stop words should be removed first.
But simple cosine similarity and TF-IDF are not very reliable under certain circumstances.
Push your own link 2333 here
It is recommended to use textrank in combination with the above algorithm