Saying that this is a crawler is a bit exaggerated, but the name is just right, so I added the word "simple" in front to show that
This is a castrated crawler, it can be used simply or played with.
The company recently has a new business to capture the data of competing products. After reading the capture system written by a previous classmate, there are certain problems.
The rules are too strong, whether it is scalability or versatility. The surface is a little weak. The previous system had to make a list,
and then crawl from this list. There was no concept of depth, which was a flaw for crawlers. Therefore, I decided to make a
slightly more general crawler, add the concept of depth, and improve the scalability and generality.
We have agreed here that the content to be processed (may be URL, user name, etc.) we will call it entity.
Considering scalability, the concept of queue is adopted here. All entities to be processed are stored in the queue. Each time it is processed,
takes out an entity from the queue. After the processing is completed, Store and store the newly captured entities in the queue. Of course, here
also needs to be stored and deduplicated, and queued and deduplicated to prevent the processor from doing useless work.
+--------+ +-----------+ +----------+ | entity | | enqueue | | result | | list | | uniq list | | uniq list| | | | | | | | | | | | | | | | | | | | | | | | | +--------+ +-----------+ +----------+
When each entity enters the queueEnqueues and requeues
Set the enqueue entity flag to one and no longer enter the queue. When the
entity is processed, the result data is obtained , after processing the result data, mark the result verses such as result data reorder list
, of course
, you can also do update processing here, and the code can be compatible.
+-------+ | 开始 | +---+---+ | v +-------+ enqueue deep为1的实体 | init |--------------------------------> +---+---+ set 已经入过队列 flag | v +---------+ empty queue +------+ +------>| dequeue +------------->| 结束 | | +----+----+ +------+ | | | | | | | v | +---------------+ enqueue deep为deep+1的实体 | | handle entity |------------------------------> | +-------+-------+ set 已经入过队列 flag | | | | | v | +---------------+ set 已经处理过结果 flag | | handle result |--------------------------> | +-------+-------+ | | +------------+
In order to crawl some websites, the most feared thing is to block the IP. Have you ever blocked the IP? The agent can only hehehe. Therefore, crawling
strategy is still very important.
Before crawling, you can search the Internet for relevant information about the website to be crawled, see if anyone has crawled it before, and absorb their experience. Then I have to carefully analyze the website requests myself to see if their website requests will bring special parameters? Not logged in
Status
Will there be related cookie? Finally, I tried to set a capture frequency as high as possible. If you must log in to the website to be crawled, you can register a batch of accounts, then simulate a successful login, and request in turn.
It will be even more troublesome if login requires a
, you can try to log in manually and then save the cookie (of course
, you can try OCR recognition if you have the ability). Of course, you still need to consider the issues mentioned in the previous paragraph after logging in. It does not mean that everything will be fine after logging in. Some websites will have their accounts blocked if their crawling frequency is too fast after logging in. Therefore, try to find a method that does not require logging in. It is troublesome to log in to blocked accounts, apply for accounts, and change accounts.
crawl data source and depth
Initial data source selection is also important. What I want to do is to crawl once a day, so I am looking for a place where the crawling website is updated daily, so that the initialization action can be fully automatic, and I basically don’t need to manage it. The crawling will start from Daily
updates occur automatically. The crawling depth is also very important. This should be determined based on the specific website, needs, and content that has been crawled, and capture as much of the website data as possible.
Optimization
The first is the queue, which has been changed to a stack-like structure. Because in the previous queue, entities with small depth were always executed first.
entity, and then process the next entity. For example, for the initial 10 entities (deep=1), the maximum crawling depth
is 3, and there are 10 sub-entities under each entity, and then the maximum length of their queues are:队列(lpush,rpop) => 1000个 修改之后的队列(lpush,lpop) => 28个
The following is a long and boring code. I originally wanted to post it on
hub, but I thought the project was a bit small, so I thought I’d post it directly. I hope my friends will speak out about the bad things, whether it is code or design.
abstract class SpiderBase { /** * @var 处理队列中数据的休息时间开始区间 */ public $startMS = 1000000; /** * @var 处理队列中数据的休息时间结束区间 */ public $endMS = 3000000; /** * @var 最大爬取深度 */ public $maxDeep = 1; /** * @var 队列最大长度,默认1w */ public $maxQueueLen = 10000; /** * @desc 给队列中插入一个待处理的实体 * 插入之前调用 @see isEnqueu 判断是否已经如果队列 * 直插入没如果队列的 * * @param $deep 插入实体在爬虫中的深度 * @param $entity 插入的实体内容 * @return bool 是否插入成功 */ abstract public function enqueue($deep, $entity); /** * @desc 从队列中取出一个待处理的实体 * 返回值示例,实体内容格式可自行定义 * [ * "deep" => 3, * "entity" => "balabala" * ] * * @return array */ abstract public function dequeue(); /** * @desc 获取待处理队列长度 * * @return int */ abstract public function queueLen(); /** * @desc 判断队列是否可以继续入队 * * @param $params mixed * @return bool */ abstract public function canEnqueue($params); /** * @desc 判断一个待处理实体是否已经进入队列 * * @param $entity 实体 * @return bool 是否已经进入队列 */ abstract public function isEnqueue($entity); /** * @desc 设置一个实体已经进入队列标志 * * @param $entity 实体 * @return bool 是否插入成功 */ abstract public function setEnqueue($entity); /** * @desc 判断一个唯一的抓取到的信息是否已经保存过 * * @param $entity mixed 用于判断的信息 * @return bool 是否已经保存过 */ abstract public function isSaved($entity); /** * @desc 设置一个对象已经保存 * * @param $entity mixed 是否保存的一句 * @return bool 是否设置成功 */ abstract public function setSaved($entity); /** * @desc 保存抓取到的内容 * 这里保存之前会判断是否保存过,如果保存过就不保存了 * 如果设置了更新,则会更新 * * @param $uniqInfo mixed 抓取到的要保存的信息 * @param $update bool 保存过的话是否更新 * @return bool */ abstract public function save($uniqInfo, $update); /** * @desc 处理实体的内容 * 这里会调用enqueue * * @param $item 实体数组,@see dequeue 的返回值 * @return */ abstract public function handle($item); /** * @desc 随机停顿时间 * * @param $startMs 随机区间开始微妙 * @param $endMs 随机区间结束微妙 * @return bool */ public function randomSleep($startMS, $endMS) { $rand = rand($startMS, $endMS); usleep($rand); return true; } /** * @desc 修改默认停顿时间开始区间值 * * @param $ms int 微妙 * @return obj $this */ public function setStartMS($ms) { $this->startMS = $ms; return $this; } /** * @desc 修改默认停顿时间结束区间值 * * @param $ms int 微妙 * @return obj $this */ public function setEndMS($ms) { $this->endMS = $ms; return $this; } /** * @desc 设置队列最长长度,溢出后丢弃 * * @param $len int 队列最大长度 */ public function setMaxQueueLen($len) { $this->maxQueueLen = $len; return $this; } /** * @desc 设置爬取最深层级 * 入队列的时候判断层级,如果超过层级不做入队操作 * * @param $maxDeep 爬取最深层级 * @return obj */ public function setMaxDeep($maxDeep) { $this->maxDeep = $maxDeep; return $this; } public function run() { while ($this->queueLen()) { $item = $this->dequeue(); if (empty($item)) continue; $item = json_decode($item, true); if (empty($item) || empty($item["deep"]) || empty($item["entity"])) continue; $this->handle($item); $this->randomSleep($this->startMS, $this->endMS); } } /** * @desc 通过curl获取链接内容 * * @param $url string 链接地址 * @param $curlOptions array curl配置信息 * @return mixed */ public function getContent($url, $curlOptions = []) { $ch = curl_init(); curl_setopt_array($ch, $curlOptions); curl_setopt($ch, CURLOPT_URL, $url); if (!isset($curlOptions[CURLOPT_HEADER])) curl_setopt($ch, CURLOPT_HEADER, 0); if (!isset($curlOptions[CURLOPT_RETURNTRANSFER])) curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); if (!isset($curlOptions[CURLOPT_USERAGENT])) curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Macintosh; Intel Mac"); $content = curl_exec($ch); if ($errorNo = curl_errno($ch)) { $errorInfo = curl_error($ch); echo "curl error : errorNo[{$errorNo}], errorInfo[{$errorInfo}]\n"; curl_close($ch); return false; } $httpCode = curl_getinfo($ch,CURLINFO_HTTP_CODE); curl_close($ch); if (200 != $httpCode) { echo "http code error : {$httpCode}, $url, [$content]\n"; return false; } return $content; } } abstract class RedisDbSpider extends SpiderBase { protected $queueName = ""; protected $isQueueName = ""; protected $isSaved = ""; public function construct($objRedis = null, $objDb = null, $configs = []) { $this->objRedis = $objRedis; $this->objDb = $objDb; foreach ($configs as $name => $value) { if (isset($this->$name)) { $this->$name = $value; } } } public function enqueue($deep, $entities) { if (!$this->canEnqueue(["deep"=>$deep])) return true; if (is_string($entities)) { if ($this->isEnqueue($entities)) return true; $item = [ "deep" => $deep, "entity" => $entities ]; $this->objRedis->lpush($this->queueName, json_encode($item)); $this->setEnqueue($entities); } else if(is_array($entities)) { foreach ($entities as $key => $entity) { if ($this->isEnqueue($entity)) continue; $item = [ "deep" => $deep, "entity" => $entity ]; $this->objRedis->lpush($this->queueName, json_encode($item)); $this->setEnqueue($entity); } } return true; } public function dequeue() { $item = $this->objRedis->lpop($this->queueName); return $item; } public function isEnqueue($entity) { $ret = $this->objRedis->hexists($this->isQueueName, $entity); return $ret ? true : false; } public function canEnqueue($params) { $deep = $params["deep"]; if ($deep > $this->maxDeep) { return false; } $len = $this->objRedis->llen($this->queueName); return $len < $this->maxQueueLen ? true : false; } public function setEnqueue($entity) { $ret = $this->objRedis->hset($this->isQueueName, $entity, 1); return $ret ? true : false; } public function queueLen() { $ret = $this->objRedis->llen($this->queueName); return intval($ret); } public function isSaved($entity) { $ret = $this->objRedis->hexists($this->isSaved, $entity); return $ret ? true : false; } public function setSaved($entity) { $ret = $this->objRedis->hset($this->isSaved, $entity, 1); return $ret ? true : false; } } class Test extends RedisDbSpider { /** * @desc 构造函数,设置redis、db实例,以及队列相关参数 */ public function construct($redis, $db) { $configs = [ "queueName" => "spider_queue:zhihu", "isQueueName" => "spider_is_queue:zhihu", "isSaved" => "spider_is_saved:zhihu", "maxQueueLen" => 10000 ]; parent::construct($redis, $db, $configs); } public function handle($item) { $deep = $item["deep"]; $entity = $item["entity"]; echo "开始抓取用户[{$entity}]\n"; echo "数据内容入库\n"; echo "下一层深度如队列\n"; echo "抓取用户[{$entity}]结束\n"; } public function save($addUsers, $update) { echo "保存成功\n"; } }
The above is the detailed content of php simple crawler. For more information, please follow other related articles on the PHP Chinese website!