Mon code de robot est le suivant. Les règles ne sont pas obtenues. Je ne sais pas quel est le problème ?
#encoding: utf-8
import re
import requests
import time
from bs4 import BeautifulSoup
import scrapy
from scrapy.http import Request
from craler.items import CralerItem
import urllib2
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
class MoyanSpider(CrawlSpider):
try:
name = 'maoyan'
allowed_domains = ["http://maoyan.com"]
start_urls = ['http://maoyan.com/films']
rules = (
Rule(LinkExtractor(allow=(r"films/\d+.*")), callback='parse_item', follow=True),
)
except Exception, e:
print e.message
#
# def start_requests(self):
# for i in range(22863):
# url = self.start_urls + str(i*30)
#
# yield Request(url,self.parse, headers=self.headers)
def parse_item(self, response):
item = CralerItem()
# time.sleep(2)
# moveis = BeautifulSoup(response.text, 'lxml').find("p",class_="movies-list").find_all("dd")
try:
time.sleep(2)
item['name'] = response.find("p",class_="movie-brief-container").find("h3",class_="name").get_text()
item['score'] = response.find("p",class_="movie-index-content score normal-score").find("span",class_="stonefont").get_text()
url = "http://maoyan.com"+response.find("p",class_="channel-detail movie-item-title").find("a")["href"]
#item['url'] = url
item['id'] = response.url.split("/")[-1]
# html = requests.get(url).content
# soup = BeautifulSoup(html,'lxml')
temp= response.find("p","movie-brief-container").find("ul").get_text()
temp = temp.split('\n')
#item['cover'] = soup.find("p","avater-shadow").find("img")["src"]
item['tags'] = temp[1]
item['countries'] = temp[3].strip()
item['duration'] = temp[4].split('/')[-1]
item['time'] = temp[6]
#print item['name']
return item
except Exception, e:
print e.message
Rappel d'erreur d'exécution :
C:\Python27\python.exe "C:\Program Files (x86)\JetBrains\PyCharm Community Edition 2016.2.2\helpers\pydev\pydevd.py" --multiproc --qt-support --client 127.0.0.1 --port 12779 --file D:/scrapy/craler/entrypoint.py
pydev debugger: process 30468 is connecting
Connected to pydev debugger (build 162.1967.10)
D:/scrapy/craler\craler\spiders\maoyan.py:12: ScrapyDeprecationWarning: Module `scrapy.contrib.linkextractors` is deprecated, use `scrapy.linkextractors` instead
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
D:/scrapy/craler\craler\spiders\maoyan.py:12: ScrapyDeprecationWarning: Module `scrapy.contrib.linkextractors.sgml` is deprecated, use `scrapy.linkextractors.sgml` instead
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
2017-05-08 21:58:14 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: craler)
2017-05-08 21:58:14 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'craler.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['craler.spiders'], 'HTTPCACHE_ENABLED': True, 'BOT_NAME': 'craler', 'COOKIES_ENABLED': False, 'DOWNLOAD_DELAY': 3}
2017-05-08 21:58:14 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2017-05-08 21:58:14 [py.warnings] WARNING: D:/scrapy/craler\craler\middlewares.py:11: ScrapyDeprecationWarning: Module `scrapy.contrib.downloadermiddleware.useragent` is deprecated, use `scrapy.downloadermiddlewares.useragent` instead
from scrapy.contrib.downloadermiddleware.useragent import UserAgentMiddleware
2017-05-08 21:58:14 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'craler.middlewares.RotateUserAgentMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats',
'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware']
2017-05-08 21:58:15 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-05-08 21:58:15 [scrapy.middleware] INFO: Enabled item pipelines:
['craler.pipelines.CralerPipeline']
2017-05-08 21:58:15 [scrapy.core.engine] INFO: Spider opened
2017-05-08 21:58:15 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-05-08 21:58:15 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-05-08 21:58:15 [root] INFO: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)
2017-05-08 21:58:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://maoyan.com/robots.txt> (referer: None) ['cached']
2017-05-08 21:58:15 [root] INFO: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50
2017-05-08 21:58:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://maoyan.com/films> (referer: None) ['cached']
2017-05-08 21:58:15 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'maoyan.com': <GET http://maoyan.com/films/248683>
2017-05-08 21:58:15 [scrapy.core.engine] INFO: Closing spider (finished)
2017-05-08 21:58:15 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 534,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 6913,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 5, 8, 13, 58, 15, 357000),
'httpcache/hit': 2,
'log_count/DEBUG': 4,
'log_count/INFO': 9,
'log_count/WARNING': 1,
'offsite/domains': 1,
'offsite/filtered': 30,
'request_depth_max': 1,
'response_received_count': 2,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2017, 5, 8, 13, 58, 15, 140000)}
2017-05-08 21:58:15 [scrapy.core.engine] INFO: Spider closed (finished)
Process finished with exit code 0
C'est principalement un problème avec
allow_domains
. Vos règles d'extraction sont correctes. Si vous écrivez le code comme ceci, vous pouvez capturer le lienallow_domains
的问题,你的提取规则是没问题的,代码这样写就能抓链接了主要就是
L'essentiel est deallow_domain
别带上http://
. rrreeeallow_domain
et de ne pas apporter la chaînehttp://
.Plusieurs composants de module sont obsolètes, vous permettant de les remplacer par des modules similaires
Juste un avertissement, aucune erreur. Peut-être que le site Web que vous avez exploré a pris des mesures anti-crawler, ce qui vous empêche de l'obtenir normalement.