分别从几个固定的网站上爬取数据;为了url去重,我用<set get>的字符串型存储?还是用<SADD SMEMBERS>的sets型存储?
需要存储url数目,大概初期在100k-1000k之间。
Collect with redisLink
Use collections, the non-repetitiveness of collections is so applicable.
$key = 'URL_HASH'; if(!$redis->hGet($key, md5($url))){ // do something ... // 抓取一个 $url 后 $redis->hSet($key, md5($url), true); }
It should be noted here that if it is multi-threaded, other processes must be considered. You can change the bool value to an enumeration value.
Collect with redis
Link
Use collections, the non-repetitiveness of collections is so applicable.
It should be noted here that if it is multi-threaded, other processes must be considered. You can change the bool value to an enumeration value.