python - 为什么我直接用requests爬网页可以,但用scrapy不行?
PHPz
PHPz 2017-04-18 10:33:18
0
3
1006
class job51():
    def __init__(self):
        self.headers={
   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Encoding':'gzip, deflate, sdch',
   'Accept-Language': 'zh-CN,zh;q=0.8',
    'Cache-Control': 'max-age=0',
   'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36',
    'Cookie':''
}

    def start(self):
        html=session.get("http://my.51job.com/cv/CResume/CV_CResumeManage.php",headers=self.headers)
        self.parse(html)

    def parse(self,response):
        tree=lxml.etree.HTML(response.text)
        resume_url=tree.xpath('//tbody/tr[@class="resumeName"]/td[1]/a/@href')
        print (resume_url[0]

能爬到我想要的结果,就是简历的url,但是用scrapy,同样的headers,页面好像停留在登录页面?

class job51(Spider):
    name = "job51"
    #allowed_domains = ["my.51job.com"]
    start_urls = ["http://my.51job.com/cv/CResume/CV_CResumeManage.php"]
    headers={
   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Encoding':'gzip, deflate, sdch',
   'Accept-Language': 'zh-CN,zh;q=0.8',
    'Cache-Control': 'max-age=0',
   'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36',
    'Cookie':''
}

    def start_requests(self):
        yield  Request(url=self.start_urls[0],headers=self.headers,callback=self.parse)

    def parse(self,response):
        #tree=lxml.etree.HTML(text)
        selector=Selector(response)
        print ("<<<<<<<<<<<<<<<<<<<<<",response.text)
        resume_url=selector.xpath('//tr[@class="resumeName"]/td[1]/a/@href')
        print (">>>>>>>>>>>>",resume_url)

输出的结果:

scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'job51', 'SPIDER_MODULES': ['job51.spiders'], 'ROBOTSTXT_OBEY': True, 'NEWSPIDER_MODULE': 'job51.spiders'}
2017-04-11 10:58:31 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole']
2017-04-11 10:58:32 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-04-11 10:58:32 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-04-11 10:58:32 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-04-11 10:58:32 [scrapy.core.engine] INFO: Spider opened
2017-04-11 10:58:32 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-04-11 10:58:32 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-04-11 10:58:33 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://my.51job.com/robots.txt> (referer: None)
2017-04-11 10:58:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://my.51job.com/cv/CResume/CV_CResumeManage.php> (referer: None)
<<<<<<<<<<<<<<<<<<<<< <script>window.location='https://login.51job.com/login.php?url=http://my.51job.com%2Fcv%2FCResume%2FCV_CResumeManage.php%3F7087';</script>
>>>>>>>>>>>> []
2017-04-11 10:58:33 [scrapy.core.scraper] ERROR: Spider error processing <GET http://my.51job.com/cv/CResume/CV_CResumeManage.php> (referer: None)
Traceback (most recent call last):
  File "d:\python35\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
    yield next(it)
  File "d:\python35\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
    for x in result:
  File "d:\python35\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 22, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "d:\python35\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "d:\python35\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "E:\WorkGitResp\spider\job51\job51\spiders\51job_resume.py", line 43, in parse
    yield Request(resume_url[0],headers=self.headers,callback=self.getResume)
  File "d:\python35\lib\site-packages\parsel\selector.py", line 58, in __getitem__
    o = super(SelectorList, self).__getitem__(pos)
IndexError: list index out of range
2017-04-11 10:58:33 [scrapy.core.engine] INFO: Closing spider (finished)
2017-04-11 10:58:33 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 628,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 5743,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 1,
 'downloader/response_status_count/404': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2017, 4, 11, 2, 58, 33, 275634),
 'log_count/DEBUG': 3,
 'log_count/ERROR': 1,
 'log_count/INFO': 7,
 'response_received_count': 2,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'spider_exceptions/IndexError': 1,
 'start_time': datetime.datetime(2017, 4, 11, 2, 58, 32, 731603)}
2017-04-11 10:58:33 [scrapy.core.engine] INFO: Spider closed (finished)

PHPz
PHPz

学习是最好的投资!

membalas semua(3)
阿神

Log menunjukkan 404. Semak sama ada pengalihan dilumpuhkan dalam tetapan buruk.

Ty80
<script>window.location='https://login.51job.com/login.php?url=http://my.51job.com%2Fcv%2FCResume%2FCV_CResumeManage.php%3F7087';</script>

Dari sini anda dapat melihat bahawa perangkak yang anda tulis menggunakan scrapy diubah hala ke halaman log masuk. Jadi ralat akan dilaporkan. Anda disyorkan agar menangkap pakej apabila menggunakan permintaan dan mengikis untuk melihat kandungan responsnya dan melihat sama ada pengepala permintaan mereka betul-betul sama. Saya mengesyaki bahawa kuki mungkin telah tamat tempoh, atau kuki mungkin tidak memindahkan kuki dengan cara ini. Saya tidak begitu biasa dengan kuki, tetapi masalahnya ialah kuki

迷茫

Permintaan sesi yang anda gunakan, pengepala permintaan sebenar mungkin telah dimuatkan dengan kuki, jadi lebih baik untuk membandingkan pengepala permintaan seperti yang dinyatakan di atas

Muat turun terkini
Lagi>
kesan web
Kod sumber laman web
Bahan laman web
Templat hujung hadapan
Tentang kita Penafian Sitemap
Laman web PHP Cina:Latihan PHP dalam talian kebajikan awam,Bantu pelajar PHP berkembang dengan cepat!