I have recently been crawling for stock-related news. What I originally imagined was that when new news is released, the program will send the latest content to the mailbox via email.
So I want to save the news titles and content into the database. When the content is updated, compare the new content with the title list in the database to see if it already exists. If it already exists, then it will not be sent. If it does not, it will not be sent. , then send it to the mailbox.
But when the number increases, the list query speed will slow down. Is there any other method you can teach me?
Deduplication of crawler tasks
Save the captured link into a set and check whether the new link is in the set.
There are many ways to remove duplicates, such as the set or Bloom filter above, which can effectively use memory and improve efficiency