redis - 爬虫如何保存已经访问过的url

Question

话说同志们在爬取数据的时候如何保存已经访问过的url和队列？对于爬取过的url，我使用redis的set来保存，访问队列是用list来保存，数据量是直线上升，内存不大，也只有4g，扛不住。不知道以前的方法是什么？

ringa_lee · Answer

I use MySQL for queuing and determining whether to access. Considering that the persistence characteristics of Redis are not very good, and I didn’t think about using Redis or anything else at the time, there is no problem in using MySQL for the time being.
The specific method is to uniquely index the md5 value of the URL. Each query is fast and the table structure is simple.
For queues, the form of table lookup is used, and the SQL is as follows (specific status represents some self-defined status):
select * from t_down_task where status = 0 order by id limit 1;
Delete completed tasks regularly

怪我咯 · Answer

http://en.wikipedia.org/wiki/Bloom_fi...

4G memory can open a very large BloomFilter. Each URL only requires a few bits, and the length of the URL has nothing to do with it. BloomFilter has a certain error rate (such as one thousandth or one percent, depending on the configuration), which will cause some web pages to be missed, but will not be crawled repeatedly.

If opening BloomFilter with 4G memory is not enough, the author needs to consider how to save the crawled web pages.

ringa_lee · Answer

When the amount of data is not very large, md5 hash storage in KV storage is relatively reliable. If the amount of index is large, it may not be enough, so you have to use some special algorithms with space compression. For example, someone mentioned above The bloom filter mentioned

Some people have also implemented this algorithm storage using the Memcache protocol, http://code.google.com/p/mc-bloom-fil...

巴扎黑 · Answer

Some persistent k/v databases can be considered, and I recommend using leveldb.

ringa_lee · Answer

It’s better not to use redis for this, use the file system directly

The collected URL is converted into a hexadecimal string through MD5, and then every 4 characters are used as a layer of directory, and the last 4 characters are used as the file name. The file content can be empty

To determine whether the URL has been collected, directly MD5 the current URL, generate the file path according to the above rules, and directly determine whether the file path exists.

巴扎黑 · Answer

Top bloomfilter. It can be used like this: use leveldb to store URLs, and then use bloomfilter to block most of the URLs that are not in the library when querying. This should be almost the same.
How many URLs does lz need to crawl? If there are a lot of URLs, such as hundreds of millions, and if you have Hadoop, you can also use Hadoop to deduplicate new URLs and old URLs. MapReduce is very fast