python - scrapy自动翻页采集,第二页跳转后,爬虫自动结束
伊谢尔伦
伊谢尔伦 2017-04-17 16:11:17
0
2
1007
# -*- coding: utf-8 -*-
import scrapy
from weather.items import WeatherItem
from scrapy.http import Request


class WeatherSpider(scrapy.Spider):
    name = "myweather"
    allowed_domains = ["http://xjh.haitou.cc/nj/uni-21"]
    start_urls = ["http://xjh.haitou.cc/nj/uni-21/page-2"]

    url="http://xjh.haitou.cc"

    def parse(self, response):
        item = WeatherItem()
        preachs=response.xpath('//table[@id="mainInfoTable"]/tbody/tr')
        for preach in preachs:
            item['corp']=preach.xpath('.//p[@class="text-success company"]/text()').extract()
            item['date']=preach.xpath('.//span[@class="hold-ymd"]/text()').extract()
            item['location']=preach.xpath('.//td[@class="text-ellipsis"]/span/text()').extract()
            item['click']=preach.xpath('.//td[@class="text-right"]/text()').extract()
            yield item

        nextlink=response.xpath('//li[@class="next"]/a/@href').extract()

        if nextlink:
            link=nextlink[0]
            print "##############"
            print self.url+link
            print "##############"

            yield Request(self.url+link,callback=self.parse )
##############
http://xjh.haitou.cc/nj/uni-21/page-3
##############
2015-10-23 22:05:57 [scrapy] DEBUG: Filtered offsite request to 'xjh.haitou.cc': <GET http://xjh.haitou.cc/nj/uni-21/page-3>
2015-10-23 22:05:57 [scrapy] INFO: Closing spider (finished)
2015-10-23 22:05:57 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 261,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 10508,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2015, 10, 23, 14, 5, 57, 9032),
 'item_scraped_count': 20,
 'log_count/DEBUG': 23,
 'log_count/INFO': 7,
 'offsite/domains': 1,
 'offsite/filtered': 1,
 'request_depth_max': 1,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2015, 10, 23, 14, 5, 56, 662979)}
2015-10-23 22:05:57 [scrapy] INFO: Spider closed (finished)
伊谢尔伦
伊谢尔伦

小伙看你根骨奇佳,潜力无限,来学PHP伐。

全部回覆(2)
刘奇

把你的allowed_domains 和start_urls修改一下即可(為了程式碼簡潔,刪除url="http://xjh.haitou.cc 這個定義(沒必要))。
修改完後,判斷有一下還有一下的話,繼續爬取資料:
yield scrapy.Request(response.urljoin(nextlink[0]),callback=self.parse )

程式碼修改如下,原因就不說了,建議參考官方文件。

class WeatherSpider(scrapy.Spider):
    name = "myweather"
    allowed_domains = ["xjh.haitou.cc"]
    start_urls = ["http://xjh.haitou.cc/nj/uni-21"]
    def parse(self, response):
        item = WeatherItem()
        preachs=response.xpath('//table[@id="mainInfoTable"]/tbody/tr')
        for preach in preachs:
            item['corp']=preach.xpath('.//p[@class="text-success company"]/text()').extract()
            item['date']=preach.xpath('.//span[@class="hold-ymd"]/text()').extract()
            item['location']=preach.xpath('.//td[@class="text-ellipsis"]/span/text()').extract()
            item['click']=preach.xpath('.//td[@class="text-right"]/text()').extract()
            yield item

        nextlink=response.xpath('//li[@class="next"]/a/@href').extract()

        if nextlink:
            yield scrapy.Request(response.urljoin(nextlink[0]),callback=self.parse )
2015-10-26 15:59:58 [scrapy] INFO: Closing spider (finished)
2015-10-26 15:59:58 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 2247,
 'downloader/request_count': 7,
 'downloader/request_method_count/GET': 7,
 'downloader/response_bytes': 71771,
 'downloader/response_count': 7,
 'downloader/response_status_count/200': 7,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2015, 10, 26, 7, 59, 58, 975394),
 'item_scraped_count': 132,
 'log_count/DEBUG': 139,
 'log_count/INFO': 7,
 'request_depth_max': 6,
 'response_received_count': 7,
 'scheduler/dequeued': 7,
 'scheduler/dequeued/memory': 7,
 'scheduler/enqueued': 7,
 'scheduler/enqueued/memory': 7,
 'start_time': datetime.datetime(2015, 10, 26, 7, 59, 56, 500595)}
2015-10-26 15:59:58 [scrapy] INFO: Spider closed (finished)

部分資料如下:
1:{"date": ["2015-10-26 12:00"], "corp": ["大通證券股份有限公司"], "location": ["教一-508 "], "click": ["159"]}
2:{"date": ["2015-10-26 14:00"], "corp": ["Goa大象設計"], "location" : ["四牌樓校區中大院309"], "click": ["497"]}
3:{"date": ["2015-10-26 14:00"], "corp": ["中國建築西南勘察設計研究院有限公司"], "location": ["四牌樓校區中山院111"], "click": ["403"]}
4:{"date": ["2015-10 -26 14:00"], "corp": ["蘇州桑泰海洋儀器研發有限責任公司"], "location": ["四牌樓校區中山院201"], "click": ["624"] }
5:{"date": ["2015-10-26 14:00"], "corp": ["大唐電信科技股份有限公司"], "location": ["四牌樓校區致知堂" ], "click": ["1031"]}
6:{"date": ["2015-10-26 14:00"], "corp": ["華信顧問設計研究院有限公司"], "location": ["教六403"], "click": ["373"]}
7:{"date": ["2015-10-26 14:00"], "corp": ["山石網科通訊技術有限公司"], "location": ["九龍湖校區教四302"], "click": ["573"]}
8:{"date": ["2015-10-26 18 :30"], "corp": ["北京凱晨置業有限公司"], "location": ["四牌樓校區榴園賓館逸夫科技館"], "click": ["254"]}
9 :{"date": ["2015-10-26 18:30"], "corp": ["中國建築國際集團有限公司"], "location": ["四牌樓校區禮東101"], " click": ["237"]}
10:{"date": ["2015-10-26 18:30"], "corp": ["無錫華潤微電子有限公司"], "location": [ "四牌樓校區群賢樓三樓報告廳"], "click": ["607"]}
11:{"date": ["2015-10-26 19:00"], "corp": [ "上海斐訊資料通訊技術有限公司"], "location": ["教一208"], "click": ["461"]}
.....
.....
129:{ "date": ["2015-11-16 14:00"], "corp": ["人本集團有限公司"], "location": ["大學生活動中心322多功能廳"], "click" : ["26"]}
130:{"date": ["2015-11-17 18:30"], "corp": ["仲量聯行測量師事務所(上海)有限公司"], "location": ["九龍湖大學生活動中心324報"], "click": ["19"]}
131:{"date": ["2015-11-18 15:30"], "corp" : ["廈門中駿集團有限公司"], "location": ["四牌樓榴園新華廳"], "click": ["63"]}
132:{"date": ["2015-11 -19 14:00"], "corp": ["理士國際技術有限公司"], "location": ["九龍湖大學生活動中心322報"], "click": ["22"]}

迷茫

給你一個建議的參考

參考連結

熱門教學
更多>
最新下載
更多>
網站特效
網站源碼
網站素材
前端模板
關於我們 免責聲明 Sitemap
PHP中文網:公益線上PHP培訓,幫助PHP學習者快速成長!