class job51():
def __init__(self):
self.headers={
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Encoding':'gzip, deflate, sdch',
'Accept-Language': 'zh-CN,zh;q=0.8',
'Cache-Control': 'max-age=0',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36',
'Cookie':''
}
def start(self):
html=session.get("http://my.51job.com/cv/CResume/CV_CResumeManage.php",headers=self.headers)
self.parse(html)
def parse(self,response):
tree=lxml.etree.HTML(response.text)
resume_url=tree.xpath('//tbody/tr[@class="resumeName"]/td[1]/a/@href')
print (resume_url[0]
能爬到我想要的结果,就是简历的url,但是用scrapy,同样的headers,页面好像停留在登录页面?
class job51(Spider):
name = "job51"
#allowed_domains = ["my.51job.com"]
start_urls = ["http://my.51job.com/cv/CResume/CV_CResumeManage.php"]
headers={
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Encoding':'gzip, deflate, sdch',
'Accept-Language': 'zh-CN,zh;q=0.8',
'Cache-Control': 'max-age=0',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36',
'Cookie':''
}
def start_requests(self):
yield Request(url=self.start_urls[0],headers=self.headers,callback=self.parse)
def parse(self,response):
#tree=lxml.etree.HTML(text)
selector=Selector(response)
print ("<<<<<<<<<<<<<<<<<<<<<",response.text)
resume_url=selector.xpath('//tr[@class="resumeName"]/td[1]/a/@href')
print (">>>>>>>>>>>>",resume_url)
输出的结果:
scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'job51', 'SPIDER_MODULES': ['job51.spiders'], 'ROBOTSTXT_OBEY': True, 'NEWSPIDER_MODULE': 'job51.spiders'}
2017-04-11 10:58:31 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
'scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole']
2017-04-11 10:58:32 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-04-11 10:58:32 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-04-11 10:58:32 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-04-11 10:58:32 [scrapy.core.engine] INFO: Spider opened
2017-04-11 10:58:32 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-04-11 10:58:32 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-04-11 10:58:33 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://my.51job.com/robots.txt> (referer: None)
2017-04-11 10:58:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://my.51job.com/cv/CResume/CV_CResumeManage.php> (referer: None)
<<<<<<<<<<<<<<<<<<<<< <script>window.location='https://login.51job.com/login.php?url=http://my.51job.com%2Fcv%2FCResume%2FCV_CResumeManage.php%3F7087';</script>
>>>>>>>>>>>> []
2017-04-11 10:58:33 [scrapy.core.scraper] ERROR: Spider error processing <GET http://my.51job.com/cv/CResume/CV_CResumeManage.php> (referer: None)
Traceback (most recent call last):
File "d:\python35\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
yield next(it)
File "d:\python35\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
for x in result:
File "d:\python35\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 22, in <genexpr>
return (_set_referer(r) for r in result or ())
File "d:\python35\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "d:\python35\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "E:\WorkGitResp\spider\job51\job51\spiders\51job_resume.py", line 43, in parse
yield Request(resume_url[0],headers=self.headers,callback=self.getResume)
File "d:\python35\lib\site-packages\parsel\selector.py", line 58, in __getitem__
o = super(SelectorList, self).__getitem__(pos)
IndexError: list index out of range
2017-04-11 10:58:33 [scrapy.core.engine] INFO: Closing spider (finished)
2017-04-11 10:58:33 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 628,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 5743,
'downloader/response_count': 2,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/404': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 4, 11, 2, 58, 33, 275634),
'log_count/DEBUG': 3,
'log_count/ERROR': 1,
'log_count/INFO': 7,
'response_received_count': 2,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'spider_exceptions/IndexError': 1,
'start_time': datetime.datetime(2017, 4, 11, 2, 58, 32, 731603)}
2017-04-11 10:58:33 [scrapy.core.engine] INFO: Spider closed (finished)
Le journal affiche 404. Vérifiez si la redirection est désactivée dans les paramètres Scrapy.
À partir de là, vous pouvez voir que le robot que vous avez écrit avec Scrapy est redirigé vers la page de connexion. Une erreur sera donc signalée. Il est recommandé de capturer le package lors de l'utilisation de requêtes et de Scrapy pour voir le contenu de sa réponse et voir si leurs en-têtes de requête sont exactement les mêmes. Je soupçonne que le cookie a peut-être expiré ou que Scrapy ne peut pas transférer les cookies de cette manière. Je ne suis pas particulièrement familier avec Scrapy, mais le problème devrait provenir des cookies
.La requête de session que vous avez utilisée, l'en-tête de requête réel est probablement déjà chargé de cookies, il est donc préférable de comparer l'en-tête de requête comme mentionné ci-dessus