In the scrapy-redis framework, xxx:requests
stored in reids has been crawled, but the program is still running. How to automatically stop the program instead of running empty?
2017-07-03 09:17:06 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-07-03 09:18:06 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
You can stop the program through engine.close_spider(spider, 'reason').
def next_request(self):
block_pop_timeout = self.idle_before_close
request = self.queue.pop(block_pop_timeout)
if request and self.stats:
self.stats.inc_value('scheduler/dequeued/redis', spider=self.spider)
if request is None:
self.spider.crawler.engine.close_spider(self.spider, 'queue is empty')
return request
There is another question I don’t understand:
When closing the spider through engine.close_spider(spider, 'reason'), several errors will occur before it can be closed.
# 正常关闭
2017-07-03 18:02:38 [scrapy.core.engine] INFO: Closing spider (queue is empty)
2017-07-03 18:02:38 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'finish_reason': 'queue is empty',
'finish_time': datetime.datetime(2017, 7, 3, 10, 2, 38, 616021),
'log_count/INFO': 8,
'start_time': datetime.datetime(2017, 7, 3, 10, 2, 38, 600382)}
2017-07-03 18:02:38 [scrapy.core.engine] INFO: Spider closed (queue is empty)
# 之后还会出现几个错误才关闭spider,难道spider刚启动时会启动多个线程一起抓取,
# 然后其中一个线程关闭了spider,其他线程就找不到spider才会报错!
Unhandled Error
Traceback (most recent call last):
File "D:/papp/project/launch.py", line 37, in <module>
process.start()
File "D:\Program Files\python3\lib\site-packages\scrapy\crawler.py", line 285, in start
reactor.run(installSignalHandlers=False) # blocking call
File "D:\Program Files\python3\lib\site-packages\twisted\internet\base.py", line 1243, in run
self.mainLoop()
File "D:\Program Files\python3\lib\site-packages\twisted\internet\base.py", line 1252, in mainLoop
self.runUntilCurrent()
--- <exception caught here> ---
File "D:\Program Files\python3\lib\site-packages\twisted\internet\base.py", line 878, in runUntilCurrent
call.func(*call.args, **call.kw)
File "D:\Program Files\python3\lib\site-packages\scrapy\utils\reactor.py", line 41, in __call__
return self._func(*self._a, **self._kw)
File "D:\Program Files\python3\lib\site-packages\scrapy\core\engine.py", line 137, in _next_request
if self.spider_is_idle(spider) and slot.close_if_idle:
File "D:\Program Files\python3\lib\site-packages\scrapy\core\engine.py", line 189, in spider_is_idle
if self.slot.start_requests is not None:
builtins.AttributeError: 'NoneType' object has no attribute 'start_requests'
How to know that the crawling of the placed requests has been completed? This needs to be defined to know
If it is not complicated, you can use the internal extension to turn it off!
scrapy.contrib.closespider.CloseSpider
CLOSESPIDER_TIMEOUT
CLOSESPIDER_ITEMCOUNT
CLOSESPIDER_PAGECOUNT
CLOSESPIDER_ERRORCOUNT
http://scrapy-chs.readthedocs...