python - 为什么我直接用requests爬网页可以,但用scrapy不行?
PHPz
PHPz 2017-04-18 10:33:18
0
3
1001
class job51():
    def __init__(self):
        self.headers={
   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Encoding':'gzip, deflate, sdch',
   'Accept-Language': 'zh-CN,zh;q=0.8',
    'Cache-Control': 'max-age=0',
   'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36',
    'Cookie':''
}

    def start(self):
        html=session.get("http://my.51job.com/cv/CResume/CV_CResumeManage.php",headers=self.headers)
        self.parse(html)

    def parse(self,response):
        tree=lxml.etree.HTML(response.text)
        resume_url=tree.xpath('//tbody/tr[@class="resumeName"]/td[1]/a/@href')
        print (resume_url[0]

能爬到我想要的结果,就是简历的url,但是用scrapy,同样的headers,页面好像停留在登录页面?

class job51(Spider):
    name = "job51"
    #allowed_domains = ["my.51job.com"]
    start_urls = ["http://my.51job.com/cv/CResume/CV_CResumeManage.php"]
    headers={
   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Encoding':'gzip, deflate, sdch',
   'Accept-Language': 'zh-CN,zh;q=0.8',
    'Cache-Control': 'max-age=0',
   'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36',
    'Cookie':''
}

    def start_requests(self):
        yield  Request(url=self.start_urls[0],headers=self.headers,callback=self.parse)

    def parse(self,response):
        #tree=lxml.etree.HTML(text)
        selector=Selector(response)
        print ("<<<<<<<<<<<<<<<<<<<<<",response.text)
        resume_url=selector.xpath('//tr[@class="resumeName"]/td[1]/a/@href')
        print (">>>>>>>>>>>>",resume_url)

输出的结果:

scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'job51', 'SPIDER_MODULES': ['job51.spiders'], 'ROBOTSTXT_OBEY': True, 'NEWSPIDER_MODULE': 'job51.spiders'}
2017-04-11 10:58:31 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole']
2017-04-11 10:58:32 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-04-11 10:58:32 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-04-11 10:58:32 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-04-11 10:58:32 [scrapy.core.engine] INFO: Spider opened
2017-04-11 10:58:32 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-04-11 10:58:32 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-04-11 10:58:33 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://my.51job.com/robots.txt> (referer: None)
2017-04-11 10:58:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://my.51job.com/cv/CResume/CV_CResumeManage.php> (referer: None)
<<<<<<<<<<<<<<<<<<<<< <script>window.location='https://login.51job.com/login.php?url=http://my.51job.com%2Fcv%2FCResume%2FCV_CResumeManage.php%3F7087';</script>
>>>>>>>>>>>> []
2017-04-11 10:58:33 [scrapy.core.scraper] ERROR: Spider error processing <GET http://my.51job.com/cv/CResume/CV_CResumeManage.php> (referer: None)
Traceback (most recent call last):
  File "d:\python35\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
    yield next(it)
  File "d:\python35\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
    for x in result:
  File "d:\python35\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 22, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "d:\python35\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "d:\python35\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "E:\WorkGitResp\spider\job51\job51\spiders\51job_resume.py", line 43, in parse
    yield Request(resume_url[0],headers=self.headers,callback=self.getResume)
  File "d:\python35\lib\site-packages\parsel\selector.py", line 58, in __getitem__
    o = super(SelectorList, self).__getitem__(pos)
IndexError: list index out of range
2017-04-11 10:58:33 [scrapy.core.engine] INFO: Closing spider (finished)
2017-04-11 10:58:33 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 628,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 5743,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 1,
 'downloader/response_status_count/404': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2017, 4, 11, 2, 58, 33, 275634),
 'log_count/DEBUG': 3,
 'log_count/ERROR': 1,
 'log_count/INFO': 7,
 'response_received_count': 2,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'spider_exceptions/IndexError': 1,
 'start_time': datetime.datetime(2017, 4, 11, 2, 58, 32, 731603)}
2017-04-11 10:58:33 [scrapy.core.engine] INFO: Spider closed (finished)

PHPz
PHPz

学习是最好的投资!

reply all(3)
阿神

The log shows 404. Check whether redirection is disabled in scrapy settings.

Ty80
<script>window.location='https://login.51job.com/login.php?url=http://my.51job.com%2Fcv%2FCResume%2FCV_CResumeManage.php%3F7087';</script>

From here you can see that the crawler you wrote using scrapy is redirected to the login page. So an error will be reported. It is recommended that you capture the packet when using requests and scrapy to check its response content, and see if their request headers are exactly the same. I suspect that the cookie may have expired, or scrapy may not transfer cookies in this way. I am not particularly familiar with scrapy, but the problem should be with cookies

迷茫

The actual request header you used for the session request is probably already loaded with cookies, so it’s better to compare the request header as mentioned above

Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template
About us Disclaimer Sitemap
php.cn:Public welfare online PHP training,Help PHP learners grow quickly!