I use MySQL for queuing and determining whether to access. Considering that the persistence characteristics of Redis are not very good, and I didn’t think about using Redis or anything else at the time, there is no problem in using MySQL for the time being.
The specific method is to uniquely index the md5 value of the URL. Each query is fast and the table structure is simple.
For queues, the form of table lookup is used, and the SQL is as follows (specific status represents some self-defined status):
select * from t_down_task where status = 0 order by id limit 1;
Delete completed tasks regularly
4G memory can open a very large BloomFilter. Each URL only requires a few bits, and the length of the URL has nothing to do with it. BloomFilter has a certain error rate (such as one thousandth or one percent, depending on the configuration), which will cause some web pages to be missed, but will not be crawled repeatedly.
If opening BloomFilter with 4G memory is not enough, the author needs to consider how to save the crawled web pages.
When the amount of data is not very large, md5 hash storage in KV storage is relatively reliable. If the amount of index is large, it may not be enough, so you have to use some special algorithms with space compression. For example, someone mentioned above The bloom filter mentioned
Some people have also implemented this algorithm storage using the Memcache protocol, http://code.google.com/p/mc-bloom-fil...
It’s better not to use redis for this, use the file system directly
The collected URL is converted into a hexadecimal string through MD5, and then every 4 characters are used as a layer of directory, and the last 4 characters are used as the file name. The file content can be empty
To determine whether the URL has been collected, directly MD5 the current URL, generate the file path according to the above rules, and directly determine whether the file path exists.
Top bloomfilter. It can be used like this: use leveldb to store URLs, and then use bloomfilter to block most of the URLs that are not in the library when querying. This should be almost the same.
How many URLs does lz need to crawl? If there are a lot of URLs, such as hundreds of millions, and if you have Hadoop, you can also use Hadoop to deduplicate new URLs and old URLs. MapReduce is very fast
I use MySQL for queuing and determining whether to access. Considering that the persistence characteristics of Redis are not very good, and I didn’t think about using Redis or anything else at the time, there is no problem in using MySQL for the time being.
The specific method is to uniquely index the md5 value of the URL. Each query is fast and the table structure is simple.
For queues, the form of table lookup is used, and the SQL is as follows (specific status represents some self-defined status):
select * from t_down_task where status = 0 order by id limit 1;
Delete completed tasks regularly
http://en.wikipedia.org/wiki/Bloom_fi...
4G memory can open a very large BloomFilter. Each URL only requires a few bits, and the length of the URL has nothing to do with it. BloomFilter has a certain error rate (such as one thousandth or one percent, depending on the configuration), which will cause some web pages to be missed, but will not be crawled repeatedly.
If opening BloomFilter with 4G memory is not enough, the author needs to consider how to save the crawled web pages.
When the amount of data is not very large, md5 hash storage in KV storage is relatively reliable. If the amount of index is large, it may not be enough, so you have to use some special algorithms with space compression. For example, someone mentioned above The bloom filter mentioned
Some people have also implemented this algorithm storage using the Memcache protocol, http://code.google.com/p/mc-bloom-fil...
Some persistent k/v databases can be considered, and I recommend using leveldb.
It’s better not to use redis for this, use the file system directly
The collected URL is converted into a hexadecimal string through MD5, and then every 4 characters are used as a layer of directory, and the last 4 characters are used as the file name. The file content can be empty
To determine whether the URL has been collected, directly MD5 the current URL, generate the file path according to the above rules, and directly determine whether the file path exists.
Top bloomfilter. It can be used like this: use leveldb to store URLs, and then use bloomfilter to block most of the URLs that are not in the library when querying. This should be almost the same.
How many URLs does lz need to crawl? If there are a lot of URLs, such as hundreds of millions, and if you have Hadoop, you can also use Hadoop to deduplicate new URLs and old URLs. MapReduce is very fast