python - scrapy自动翻页采集,第二页跳转后,爬虫自动结束
伊谢尔伦
伊谢尔伦 2017-04-17 16:11:17
0
2
1082
# -*- coding: utf-8 -*-
import scrapy
from weather.items import WeatherItem
from scrapy.http import Request


class WeatherSpider(scrapy.Spider):
    name = "myweather"
    allowed_domains = ["http://xjh.haitou.cc/nj/uni-21"]
    start_urls = ["http://xjh.haitou.cc/nj/uni-21/page-2"]

    url="http://xjh.haitou.cc"

    def parse(self, response):
        item = WeatherItem()
        preachs=response.xpath('//table[@id="mainInfoTable"]/tbody/tr')
        for preach in preachs:
            item['corp']=preach.xpath('.//p[@class="text-success company"]/text()').extract()
            item['date']=preach.xpath('.//span[@class="hold-ymd"]/text()').extract()
            item['location']=preach.xpath('.//td[@class="text-ellipsis"]/span/text()').extract()
            item['click']=preach.xpath('.//td[@class="text-right"]/text()').extract()
            yield item

        nextlink=response.xpath('//li[@class="next"]/a/@href').extract()

        if nextlink:
            link=nextlink[0]
            print "##############"
            print self.url+link
            print "##############"

            yield Request(self.url+link,callback=self.parse )
##############
http://xjh.haitou.cc/nj/uni-21/page-3
##############
2015-10-23 22:05:57 [scrapy] DEBUG: Filtered offsite request to 'xjh.haitou.cc': <GET http://xjh.haitou.cc/nj/uni-21/page-3>
2015-10-23 22:05:57 [scrapy] INFO: Closing spider (finished)
2015-10-23 22:05:57 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 261,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 10508,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2015, 10, 23, 14, 5, 57, 9032),
 'item_scraped_count': 20,
 'log_count/DEBUG': 23,
 'log_count/INFO': 7,
 'offsite/domains': 1,
 'offsite/filtered': 1,
 'request_depth_max': 1,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2015, 10, 23, 14, 5, 56, 662979)}
2015-10-23 22:05:57 [scrapy] INFO: Spider closed (finished)
伊谢尔伦
伊谢尔伦

小伙看你根骨奇佳,潜力无限,来学PHP伐。

모든 응답(2)
刘奇

allowed_domains 및 start_urls만 수정하세요.(코드의 단순성을 위해 url="http://xjh.haitou.cc 정의를 삭제하세요(불필요))
수정 후에도 여전히 문제가 있다고 판단되면 크롤링을 계속하세요. 데이터 가져오기:
yield scrapy.Request(response.urljoin(nextlink[0]),callback=self.parse)

코드가 다음과 같이 수정되었습니다. 자세한 내용은 공식 문서를 참고하는 것이 좋습니다.

으아아아 으아아아

데이터의 일부는 다음과 같습니다:
1:{"date": ["2015-10-26 12:00"], "corp": ["Datong Securities Co., Ltd."], "위치": [" Jiaoyi-508"], "클릭": ["159"]}
2:{"날짜": ["2015-10-26 14:00"], "corp": ["Goa Elephant Design"], "location": ["309 Zhongyuan, Sipailou Campus"], "click": ["497"]}
3: {"날짜 ": ["2015-10-26 14:00"], "corp": ["China Southwest Architectural Survey and Design Institute Co., Ltd."], "location": ["111, Sun Yat-sen University, Sipailou 캠퍼스"], "click": ["403"]}
4:{"date": ["2015-10-26 14:00"], "corp ": [" Suzhou Suntai Marine Instrument R&D Co., Ltd."], "위치": ["201, Sun Yat-sen University, Sipailou 캠퍼스"], "클릭": ["624"]}
5:{"date": ["2015-10-26 14:00"], "corp": ["Datang Telecom Technology Co., Ltd."], "location": [ "Zhizhi Hall, Sipailou Campus"], "click": ["1031"]}
6:{"date": ["2015-10-26 14:00"], "corp": ["Huaxin Consulting and Design Institute Co., Ltd."], "location": ["Jiaoliu 403"], "click": ["373"]}
7 : {"날짜": ["2015-10-26 14:00"], "corp": ["Shanshi Network Communication Technology Co., Ltd."], "location": ["Jiulong Lake Campus Teaching 4 302" ], "click": ["573"]}
8:{"date": ["2015-10-26 18:30"], "corp": ["Beijing Kaichen Real Estate Co., Ltd."] , "위치": ["Yifu 과학 기술 박물관, Liuyuan 호텔, Sipailou 캠퍼스"], "클릭": ["254"]}
9:{"날짜": ["2015-10-26 18:30"], "corp": ["China Construction International Group Co., Ltd."], "location": ["Lidong 101, Sipailou Campus"], "click": [ "237 "]}
10:{"date": ["2015-10-26 18:30"], "corp": ["Wuxi China Resources Microelectronics Co., Ltd."], "location": ["Sipailou 캠퍼스 Qunxian 빌딩 3층 강의실"], "click": ["607"]}
11:{"date": ["2015-10-26 19: 00" ], "corp": ["Shanghai Feixun Data Communication Technology Co., Ltd."], "location": ["Jiaoyi 208"], "click": ["461"]}
....
....
129:{"date": ["2015-11-16 14:00"], "corp": [ "Renben (주)그룹"], "위치": ["대학교 학생활동센터 다기능홀 322호"], "클릭": ["26"]}
130:{"날짜": ["2015-11-17 18:30"], "corp": ["Jones Lang LaSalle Surveyors (Shanghai) Co., Ltd."], "location": ["Jiulong Lake 학생 활동 센터 324 보고서"] , "클릭": ["19"]}
131:{"날짜": ["2015-11-18 15:30"], "회사": ["Xiamen Zhongjun Group Co., Ltd."], "위치": ["Sipailou Liuyuan Xinhua Hall"], "클릭": ["63"]}
132:{"날짜": ["2015- 11-19 14:00"], "corp": ["Leoch International Technology Co., Ltd."], "location": ["Jiulong Lake 학생 활동 센터 322 신문"], "click": ["22"]}

迷茫

추천 참고자료 제공

참고링크

최신 다운로드
더>
웹 효과
웹사이트 소스 코드
웹사이트 자료
프론트엔드 템플릿