class job51():
def __init__(self):
self.headers={
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Encoding':'gzip, deflate, sdch',
'Accept-Language': 'zh-CN,zh;q=0.8',
'Cache-Control': 'max-age=0',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36',
'Cookie':''
}
def start(self):
html=session.get("http://my.51job.com/cv/CResume/CV_CResumeManage.php",headers=self.headers)
self.parse(html)
def parse(self,response):
tree=lxml.etree.HTML(response.text)
resume_url=tree.xpath('//tbody/tr[@class="resumeName"]/td[1]/a/@href')
print (resume_url[0]
能爬到我想要的结果,就是简历的url,但是用scrapy,同样的headers,页面好像停留在登录页面?
class job51(Spider):
name = "job51"
#allowed_domains = ["my.51job.com"]
start_urls = ["http://my.51job.com/cv/CResume/CV_CResumeManage.php"]
headers={
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Encoding':'gzip, deflate, sdch',
'Accept-Language': 'zh-CN,zh;q=0.8',
'Cache-Control': 'max-age=0',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36',
'Cookie':''
}
def start_requests(self):
yield Request(url=self.start_urls[0],headers=self.headers,callback=self.parse)
def parse(self,response):
#tree=lxml.etree.HTML(text)
selector=Selector(response)
print ("<<<<<<<<<<<<<<<<<<<<<",response.text)
resume_url=selector.xpath('//tr[@class="resumeName"]/td[1]/a/@href')
print (">>>>>>>>>>>>",resume_url)
输出的结果:
scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'job51', 'SPIDER_MODULES': ['job51.spiders'], 'ROBOTSTXT_OBEY': True, 'NEWSPIDER_MODULE': 'job51.spiders'}
2017-04-11 10:58:31 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
'scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole']
2017-04-11 10:58:32 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-04-11 10:58:32 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-04-11 10:58:32 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-04-11 10:58:32 [scrapy.core.engine] INFO: Spider opened
2017-04-11 10:58:32 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-04-11 10:58:32 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-04-11 10:58:33 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://my.51job.com/robots.txt> (referer: None)
2017-04-11 10:58:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://my.51job.com/cv/CResume/CV_CResumeManage.php> (referer: None)
<<<<<<<<<<<<<<<<<<<<< <script>window.location='https://login.51job.com/login.php?url=http://my.51job.com%2Fcv%2FCResume%2FCV_CResumeManage.php%3F7087';</script>
>>>>>>>>>>>> []
2017-04-11 10:58:33 [scrapy.core.scraper] ERROR: Spider error processing <GET http://my.51job.com/cv/CResume/CV_CResumeManage.php> (referer: None)
Traceback (most recent call last):
File "d:\python35\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
yield next(it)
File "d:\python35\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
for x in result:
File "d:\python35\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 22, in <genexpr>
return (_set_referer(r) for r in result or ())
File "d:\python35\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "d:\python35\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "E:\WorkGitResp\spider\job51\job51\spiders\51job_resume.py", line 43, in parse
yield Request(resume_url[0],headers=self.headers,callback=self.getResume)
File "d:\python35\lib\site-packages\parsel\selector.py", line 58, in __getitem__
o = super(SelectorList, self).__getitem__(pos)
IndexError: list index out of range
2017-04-11 10:58:33 [scrapy.core.engine] INFO: Closing spider (finished)
2017-04-11 10:58:33 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 628,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 5743,
'downloader/response_count': 2,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/404': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 4, 11, 2, 58, 33, 275634),
'log_count/DEBUG': 3,
'log_count/ERROR': 1,
'log_count/INFO': 7,
'response_received_count': 2,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'spider_exceptions/IndexError': 1,
'start_time': datetime.datetime(2017, 4, 11, 2, 58, 32, 731603)}
2017-04-11 10:58:33 [scrapy.core.engine] INFO: Spider closed (finished)
看log是404了,你看看scrapy设定那里有没把重定向禁止了。
从这里可以看到你用scrapy写的爬虫被重定向到登陆页面了。所以会报错。建议你在用requests和用scrapy请求的时候抓一下包,看看它的响应内容,并且看看它们的request headers是不是完全相同。我怀疑可能是cookie过期了,要么scrapy可能不是这样传cookie.我对scrapy不是特别熟悉,不过看问题应该是出在cookie这块了
你用的session请求的,实际的request header估计已经戴上了cookie了,所以还是像楼上说的对比请求header吧