Dikumpul daripada Baidu Tieba
python 2.7.11
scrapy 1.3.3
Selagi user_agent didayakan dalam settings.py, tidak kira yang mana kaedah berikut digunakan. Tiada apa-apa yang boleh dipilih.
Dan matikan ejen pengguna ini. Semua boleh dikutip seperti biasa. Adakah ini pelik? Awak tak tahu kenapa?
EJEN_PENGUNA = 'xxxxxxxxxxxxxxxxxxxxxx'
Atau tulis kelas middleware RotateUserAgentMiddleware(UserAgentMiddleware):
Tetapkan dalam tetapan.py
DOWNLOADER_MIDDLEWARES = {
#'tbtest.middlewares.MyCustomDownloaderMiddleware': 543,
'tbtest.useragent.RotateUserAgentMiddleware': 400,
}
Selagi agen_pengguna didayakan, tiada apa yang boleh dikumpul. Selepas berlari. Keluarkan kod berikut:
E:\pypro\tbtest>scrapy crawl tbs
2017-05-11 12:20:23 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: tbtest)
2017-05-11 12:20:23 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tbtest.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['tbtest.spiders'], 'BOT_NAME': 'tbtest', 'COOKIES_ENABLED'
: False, 'DOWNLOAD_DELAY': 2}
2017-05-11 12:20:24 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2017-05-11 12:20:26 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'tbtest.useragent.RotateUserAgentMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-05-11 12:20:26 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-05-11 12:20:27 [scrapy.middleware] INFO: Enabled item pipelines:
['tbtest.pipelines.TbtestPipeline']
2017-05-11 12:20:27 [scrapy.core.engine] INFO: Spider opened
2017-05-11 12:20:27 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-05-11 12:20:27 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
********Current UserAgent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3************
2017-05-11 12:20:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://tieba.baidu.com/robots.txt> (referer: None)
********Current UserAgent:Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5************
2017-05-11 12:20:31 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://tieba.baidu.com/f?kw=%E5%B1%B1%E4%B8%9C%E7%90%86%E5%B7%A5%E5%A4%A7%E5%AD%A6&ie=utf-8> (referer: None)
2017-05-11 12:20:31 [scrapy.core.engine] INFO: Closing spider (finished)
2017-05-11 12:20:31 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 655,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 87876,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 5, 11, 4, 20, 31, 375000),
'log_count/DEBUG': 3,
'log_count/INFO': 7,
'response_received_count': 2,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2017, 5, 11, 4, 20, 27, 250000)}
2017-05-11 12:20:31 [scrapy.core.engine] INFO: Spider closed (finished)
# -*- coding:utf-8 -*-
import logging
##"""避免被ban策略之一:使用useragent池。 使用注意:需在settings.py中进行相应的设置。"""
import random
from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware
class RotateUserAgentMiddleware(UserAgentMiddleware):
def __init__(self, user_agent=''):
self.user_agent = user_agent
def process_request(self, request, spider):
ua = random.choice(self.user_agent_list)
if ua:
#显示当前使用的useragent
print "********Current UserAgent:%s************" %ua
#记录
##logging.log(logging.WARNING, 'Current UserAgent: '+ua)
request.headers.setdefault('User-Agent', ua)
#the default user_agent_list composes chrome,I E,firefox,Mozilla,opera,netscape
#for more user agent strings,you can find it in http://www.useragentstring.com/pages/useragentstring.php
user_agent_list =[
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
"Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
"Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
"Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
"Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
"Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
"Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",
"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",
"Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52",
]
Tidak kira yang mana kaedah berikut digunakan. Tiada apa-apa yang boleh dipilih.
Dan matikan ejen pengguna ini. Semua boleh dikutip seperti biasa. Adakah ini pelik? Awak tak tahu kenapa?
Tapak web yang anda merangkak mungkin mempunyai beberapa langkah anti-perangkak
Selepas anti-merangkak, scrapy akan mempunyai ejen pengguna yang ditentukan sendiri, yang akan ditambahkan pada kepala selepas didayakan, ia mungkin kosong, atau mungkin tidak anti-merangkak
Adalah disyorkan untuk buat kumpulan ejen pengguna untuk meniru penyemak imbas, dan gantikannya secara kerap atau rawak, supaya Yang paling selamat
Itu Ejen Pengguna, bukan Ejen_Pengguna Saya mempunyai masalah ini sebelum ini, saya baru menukarnya kemudian