PythonVersion management: pyenv and pyenv-virtualenv
Scrapy crawler introductory tutorial 1InstallationandBasic use
Scrapy crawler introductory tutorial 2Officially provided Demo
Scrapy crawler introductory tutorial 3Command line tool introduction and examples
Scrapy crawler introductory tutorial 4 Spider (crawler)
Scrapy crawler introductory tutorial 5 Selectors (selector)
Scrapy crawler introductory tutorial 6 Items (project)
Scrapy crawler introductory tutorial 7 Item Loaders (project loader)
Scrapy crawler introductory tutorial Tutorial 8 Interactive shell is convenient for debugging Request and Response (request and response) Scrapy crawler introductory tutorial twelve Link Extract
ors (link extractor)
[toc]
Development Environment:
(currently the latest) Scrapy 1.3.2 version
(currently the latest) Spider
A crawler is a class that defines how to crawl a website (or a group of websites), including how to perform the crawl (i.e., follow links) and how to extract structured data from its web pages (i.e., crawl items). In a nutshell, a spider is where you define custom
behavior
Loop Go through something like this:
You first generate the initial requests for scraping the first URL, and then specify that you want to use the responses downloaded from those requests The callback function
invoked.URL specified in start_urls and The parse method is obtained, and this method serves as the callback function of the request.
In the callback function, you will parse the response (web page) and return it with the extracted data, Item. Object , a Request object, or an iterable of these objects. These requests will also contain callbacks (which may be the same), which are then downloaded by Scrapy, and their responses handled by the specified callback.
In the callback function, you usually use selectors to parse the page content (but you can also use BeautifulSoup,l
xmlFinally, items returned from the crawler are typically persisted to a database (in some item pipeline) or written to a file using a feed export.
inherit <a href="http://www.php.cn/wiki/164.html" target="_blank"> Crawlers (including those bundled with Scrapy, as well as crawlers you write yourself). It doesn't provide any special features. It simply provides a default </a>start_requests()
implementation that sends requests from the attribute and parse
is called for each resulting response spider
method. name
String
must be unique . However, there is nothing stopping you from instantiating multiple instances of the same crawler. This is the most important crawler attribute and it is required.
If the crawler crawls a single domain name, the common practice is to name the crawler after the domain. So, for example, a crawler that crawls mywebsite.com would typically be called mywebsite. NOTE
allowed_domains
Others will not be captured. <a href="http://www.php.cn/wiki/646.html" target="_blank"></a>
start_urls
The list of URLs that the crawler will start crawling when no specific URL is specified.
custom_<a href="http://www.php.cn/code/8209.html" target="_blank">set</a>tings
A dictionary of settings that will be overridden from the project wide configuration when running this crawler. It must be defined as a class attribute because the setting is updated before instantiation.
For a list of available built-in settings, see: Built-in Settings Reference.
crawler
This attribute is set by the class method from_crawler() after initializing the class and links the Crawler to the object to which this crawler instance is bound.
Crawlers encapsulate many components in the project for single entry access (such as extensions, Middleware, signal managers, etc.). See CrawlerAPI for details.
settings
Configurations for running this crawler. This is a Settings instance, see Settings Themes for a detailed introduction to this topic.
logger
Python logger created with Spidername
. You can use this to send log messages through it, as described in Logging a crawler.
from_crawler
(crawler, args, * kwargs)
is the class method used by Scrapy to create crawlers.
You may not need to override this directly, as the default implementation acts as a proxy for the method, init()
It is called with the given argument args and named argument kwargs.
Nevertheless, this method sets the crawler and settings properties in the new instance so that they can be accessed later in the crawler.
Parameters:
crawler(Crawlerinstance) - The crawler to which the crawler
args (list) - Arguments passed to the init() method
kwargs (dict) - Parameters passed to the init() Keyword arguments to the method
start_requests()
This method must return an iterable of the first request to crawl this crawler.
With start_requests(), start_urls is not written, and it is useless even if it is written.
The default implementation is: start_urls, but the method start_requests can be overridden.
For example, if you need to start by logging in using a POST request , you can:
class MySpider(scrapy.Spider): name = 'myspider' def start_requests(self): return [scrapy.FormRequest("http://www.example.com/login", formdata={'user': 'john', 'pass': 'secret'}, callback=self.logged_in)] def logged_in(self, response): # here you would extract links to follow and return Requests for # each of them, with another callback pass
make_requests_from_url(url)
A way to receive a URL and Method that returns a Request object (or list of Request objects) for fetching. This method is used to construct the initial request within the start_requests() method, and is typically used to convert URLs into requests.
Unless overridden, this method returns Requests that have the parse() method as their callback function and the dont_filter parameter enabled (see the Request class for more information).
parse(response)
This is Scrapy's default callback for handling downloaded responses when their request does not specify a callback.
The parse method is responsible for processing the response and returning the crawled data or more URLs. Other request callbacks have the same requirements as the Spider class.
This method and any other request callback must return an iterable Request and dicts or Item objects.
Parameters:
response (Response) - the parsed response
log(message[, level, component])
Wrapper sends log message logger through crawler to maintain backward compatibility. See Logging from Spider for details.
closed(reason)
Called when the crawler is closed. This method provides a shortcut to signals.connect() for the spider_closed signal.
Let's look at an example:
import scrapy class MySpider(scrapy.Spider): name = 'example.com' allowed_domains = ['example.com'] start_urls = [ 'http://www.example.com/1.html', 'http://www.example.com/2.html', 'http://www.example.com/3.html', ] def parse(self, response): self.logger.info('A response from %s just arrived!', response.url)
Returning multiple requests and items from a single callback:
import scrapy class MySpider(scrapy.Spider): name = 'example.com' allowed_domains = ['example.com'] start_urls = [ 'http://www.example.com/1.html', 'http://www.example.com/2.html', 'http://www.example.com/3.html', ] def parse(self, response): for h3 in response.xpath('//h3').extract(): yield {"title": h3} for url in response.xpath('//a/@href').extract(): yield scrapy.Request(url, callback=self.parse)
Instead of start_urls; items you can use start_requests() directly It can make it easier to obtain data:
import scrapy from myproject.items import MyItem class MySpider(scrapy.Spider): name = 'example.com' allowed_domains = ['example.com'] def start_requests(self): yield scrapy.Request('http://www.example.com/1.html', self.parse) yield scrapy.Request('http://www.example.com/2.html', self.parse) yield scrapy.Request('http://www.example.com/3.html', self.parse) def parse(self, response): for h3 in response.xpath('//h3').extract(): yield MyItem(title=h3) for url in response.xpath('//a/@href').extract(): yield scrapy.Request(url, callback=self.parse)
The crawler can receive parameters that modify its behavior. Some common uses of crawler parameters are to define a starting URL or to limit crawling to certain parts of the website, but they can be used to configure any feature of the crawler.
Spider crawl parameter is passed through the command using the -a option. For example:
scrapy crawl myspider -a category=electronics
Crawlers can access parameters in their init method:
import scrapy class MySpider(scrapy.Spider): name = 'myspider' def init(self, category=None, *args, **kwargs): super(MySpider, self).init(*args, **kwargs) self.start_urls = ['http://www.example.com/categories/%s' % category] # ...
The default init method will take any crawler parameters and copy them to the crawler as attributes. The above example could also be written as follows:
import scrapy class MySpider(scrapy.Spider): name = 'myspider' def start_requests(self): yield scrapy.Request('http://www.example.com/categories/%s' % self.category)
Remember that the spider parameter is just a string. The crawler won't do any parsing on its own. If you want to set the start_urls property from the command line, you must parse it as a list yourself, using something like ast.literal_eval or json.loads, and then set it as a property. Otherwise, you end up iterating over a start_urls string (a very common Python trap), causing each character to be treated as a separate URL.
有效的用例是设置使用的http验证凭据HttpAuthMiddleware 或用户代理使用的用户代理UserAgentMiddleware:scrapy crawl myspider -a http_user=myuser -a http_pass=mypassw<a href="http://www.php.cn/wiki/1360.html" target="_blank">ord</a> -a user_agent=mybot
Spider参数也可以通过Scrapyd schedule.jsonAPI 传递。请参阅Scrapyd文档。
Scrapy附带一些有用的通用爬虫,你可以使用它来子类化你的爬虫。他们的目的是为一些常见的抓取案例提供方便的功能,例如根据某些规则查看网站上的所有链接,从站点地图抓取或解析XML / CSV Feed。
对于在以下爬虫中使用的示例,我们假设您有一个TestItem
在myproject.items
模块中声明的项目:
import scrapy class TestItem(scrapy.Item): id = scrapy.Field() name = scrapy.Field() description = scrapy.Field()
类 scrapy.spiders.CrawlSpider
这是最常用的爬行常规网站的爬虫,因为它通过定义一组规则为下列链接提供了一种方便的机制。它可能不是最适合您的特定网站或项目,但它是足够通用的几种情况,所以你可以从它开始,根据需要覆盖更多的自定义功能,或只是实现自己的爬虫。
除了从Spider继承的属性(你必须指定),这个类支持一个新的属性:
rules
它是一个(或多个)Rule
对象的列表。每个都Rule
定义了抓取网站的某种行为。规则对象如下所述。如果多个规则匹配相同的链接,则将根据它们在此属性中定义的顺序使用第一个。
这个爬虫还暴露了可覆盖的方法:
parse_start_url(response)
对于start_urls响应调用此方法。它允许解析初始响应,并且必须返回Item
对象,Request
对象或包含任何对象的迭代器。
class scrapy.spiders.Rule(link_extractor,callback = None,cb_kwargs = None,follow = None,process_links = None,process_request = None )
link_extractor
是一个链接提取程序对象,它定义如何从每个爬网页面提取链接。
callback
是一个可调用的或字符串(在这种情况下,将使用具有该名称的爬虫对象的方法),以便为使用指定的link_extractor提取的每个链接调用。这个回调接收一个响应作为其第一个参数,并且必须返回一个包含Item和 Request对象(或它们的任何子类)的列表。
警告
当编写爬网爬虫规则时,避免使用parse
作为回调,因为CrawlSpider
使用parse
方法本身来实现其逻辑。所以如果你重写的parse
方法,爬行爬虫将不再工作。
cb_kwargs
是包含要传递给回调函数的关键字参数的dict。
follow
是一个布尔值,它指定是否应该从使用此规则提取的每个响应中跟踪链接。如果callback
是None follow
默认为True
,否则默认为False
。
process_links
是一个可调用的或一个字符串(在这种情况下,将使用具有该名称的爬虫对象的方法),将使用指定从每个响应提取的每个链接列表调用该方法link_extractor
。这主要用于过滤目的。
process_request
是一个可调用的或一个字符串(在这种情况下,将使用具有该名称的爬虫对象的方法),它将被此规则提取的每个请求调用,并且必须返回一个请求或无(过滤出请求) 。
现在让我们来看一个CrawlSpider的例子:
import scrapy from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor class MySpider(CrawlSpider): name = 'example.com' allowed_domains = ['example.com'] start_urls = ['http://www.example.com'] rules = ( # Extract links matching 'category.php' (but not matching 'subsection.php') # and follow links from them (since no callback means follow=True by default). Rule(LinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))), # Extract links matching 'item.php' and parse them with the spider's method parse_item Rule(LinkExtractor(allow=('item\.php', )), callback='parse_item'), ) def parse_item(self, response): self.logger.info('Hi, this is an item page! %s', response.url) item = scrapy.Item() item['id'] = response.xpath('//td[@id="item_id"]/text()').re(r'ID: (\d+)') item['name'] = response.xpath('//td[@id="item_name"]/text()').extract() item['description'] = response.xpath('//td[@id="item_description"]/text()').extract() return item
这个爬虫会开始抓取example.com的主页,收集类别链接和项链接,用parse_item方法解析后者。对于每个项目响应,将使用XPath从HTML中提取一些数据,并将Item使用它填充。
class scrapy.spiders.XMLFeedSpider
XMLFeedSpider设计用于通过以特定节点名称迭代XML订阅源来解析XML订阅源。迭代器可以选自:iternodes,xml和html。iternodes为了性能原因,建议使用迭代器,因为xml和迭代器html一次生成整个DOM为了解析它。但是,html当使用坏标记解析XML时,使用作为迭代器可能很有用。
要设置迭代器和标记名称,必须定义以下类属性:
iterator
定义要使用的迭代器的字符串。它可以是:
'iternodes'
- 基于正则表达式的快速迭代器
'html'
- 使用的迭代器Selector。请记住,这使用DOM解析,并且必须加载所有DOM在内存中,这可能是一个大饲料的问题
'xml'
- 使用的迭代器Selector。请记住,这使用DOM解析,并且必须加载所有DOM在内存中,这可能是一个大饲料的问题
它默认为:'iternodes'
。
itertag
一个具有要迭代的节点(或元素)的名称的字符串。示例:itertag = 'product'
namespaces
定义该文档中将使用此爬虫处理的命名空间的元组列表。在 与将用于自动注册使用的命名空间 的方法。(prefix, uri)prefixuriregister_namespace()
然后,您可以在属性中指定具有命名空间的itertag 节点。
例:
class YourSpider(XMLFeedSpider): namespaces = [('n', 'http://www.sitemaps.org/schemas/sitemap/0.9')] itertag = 'n:url' # ...
除了这些新的属性,这个爬虫也有以下可重写的方法:
adapt_response(response)
一种在爬虫开始解析响应之前,在响应从爬虫中间件到达时立即接收的方法。它可以用于在解析之前修改响应主体。此方法接收响应并返回响应(它可以是相同的或另一个)。
parse_node(response, selector)
对于与提供的标记名称(itertag)匹配的节点,将调用此方法。接收Selector每个节点的响应和 。覆盖此方法是必需的。否则,你的爬虫将不工作。此方法必须返回一个Item对象,一个 Request对象或包含任何对象的迭代器。
process_results(response, results)
对于由爬虫返回的每个结果(Items or Requests),将调用此方法,并且它将在将结果返回到框架核心之前执行所需的任何最后处理,例如设置项目ID。它接收结果列表和产生那些结果的响应。它必须返回结果列表(Items or Requests)。
这些爬虫很容易使用,让我们看一个例子:
from scrapy.spiders import XMLFeedSpider from myproject.items import TestItem class MySpider(XMLFeedSpider): name = 'example.com' allowed_domains = ['example.com'] start_urls = ['http://www.example.com/feed.xml'] iterator = 'iternodes' # This is actually unnecessary, since it's the default value itertag = 'item' def parse_node(self, response, node): self.logger.info('Hi, this is a <%s> node!: %s', self.itertag, ''.join(node.extract())) item = TestItem() item['id'] = node.xpath('@id').extract() item['name'] = node.xpath('name').extract() item['description'] = node.xpath('description').extract() return item
基本上我们做的是创建一个爬虫,从给定的下载一个start_urls,然后遍历每个item标签,打印出来,并存储一些随机数据Item。
class scrapy.spiders.CSVF
这个爬虫非常类似于XMLFeedSpider,除了它迭代行,而不是节点。在每次迭代中调用的方法是parse_row()。
delimiter
CSV文件中每个字段的带分隔符的字符串默认为','(逗号)。
quotechar
CSV文件中每个字段的包含字符的字符串默认为'"'(引号)。
<a href="http://www.php.cn/html/html-HEAD-2.html" target="_blank">head</a>ers
文件CSV Feed中包含的行的列表,用于从中提取字段。
parse_row(response, row)
使用CSV文件的每个提供(或检测到)标头的键接收响应和dict(表示每行)。这个爬虫还给予机会重写adapt_response和process_results方法的前和后处理的目的。
让我们看一个类似于前一个例子,但使用 CSVFeedSpider:
from scrapy.spiders import CSVFeedSpider from myproject.items import TestItem class MySpider(CSVFeedSpider): name = 'example.com' allowed_domains = ['example.com'] start_urls = ['http://www.example.com/feed.csv'] delimiter = ';' quotechar = "'" headers = ['id', 'name', 'description'] def parse_row(self, response, row): self.logger.info('Hi, this is a row!: %r', row) item = TestItem() item['id'] = row['id'] item['name'] = row['name'] item['description'] = row['description'] return item
class scrapy.spiders.SitemapSpider
SitemapSpider允许您通过使用Sitemaps发现网址来抓取网站。
它支持嵌套Sitemap和从robots.txt发现Sitemap网址 。
sitemap_urls
指向您要抓取的网址的网站的网址列表。
您还可以指向robots.txt,它会解析为从中提取Sitemap网址。
sitemap_rules
元组列表其中:(regex, callback)
regex是与从Sitemap中提取的网址相匹配的正则表达式。 regex可以是一个str或一个编译的正则表达式对象。
callback是用于处理与正则表达式匹配的url的回调。callback可以是字符串(指示蜘蛛方法的名称)或可调用的。
例如:sitemap_rules = [('/product/', 'parse_product')]
规则按顺序应用,只有匹配的第一个将被使用。
如果省略此属性,则会在parse回调中处理在站点地图中找到的所有网址。
sitemap_follow
应遵循的网站地图的正则表达式列表。这只适用于使用指向其他Sitemap文件的Sitemap索引文件的网站。
默认情况下,将跟踪所有网站地图。
sitemap_alternate_links
指定是否url应遵循一个备用链接。这些是在同一个url块中传递的另一种语言的同一网站的链接。
例如:
<url> <loc>http://example.com/</loc> <xhtml:link rel="alternate" hreflang="de" href="http://example.com/de"/> </url>
使用sitemap_alternate_linksset
,这将检索两个URL。随着 sitemap_alternate_links
禁用,只有http://example.com/将进行检索。
默认为sitemap_alternate_links
禁用。
最简单的示例:使用parse回调处理通过站点地图发现的所有网址 :
from scrapy.spiders import SitemapSpider class MySpider(SitemapSpider): sitemap_urls = ['http://www.example.com/sitemap.xml'] def parse(self, response): pass # ... scrape item here ...
使用某个回调处理一些网址,并使用不同的回调处理其他网址:
from scrapy.spiders import SitemapSpider class MySpider(SitemapSpider): sitemap_urls = ['http://www.example.com/sitemap.xml'] sitemap_rules = [ ('/product/', 'parse_product'), ('/category/', 'parse_category'), ] def parse_product(self, response): pass # ... scrape product ... def parse_category(self, response): pass # ... scrape category ...
关注robots.txt文件中定义的sitemaps,并且只跟踪其网址包含/sitemap_shop
以下内容的Sitemap :
from scrapy.spiders import SitemapSpider class MySpider(SitemapSpider): sitemap_urls = ['http://www.example.com/robots.txt'] sitemap_rules = [ ('/shop/', 'parse_shop'), ] sitemap_follow = ['/sitemap_shops'] def parse_shop(self, response): pass # ... scrape shop here ...
将SitemapSpider与其他来源网址结合使用:
from scrapy.spiders import SitemapSpider class MySpider(SitemapSpider): sitemap_urls = ['http://www.example.com/robots.txt'] sitemap_rules = [ ('/shop/', 'parse_shop'), ] other_urls = ['http://www.example.com/about'] def start_requests(self): requests = list(super(MySpider, self).start_requests()) requests += [scrapy.Request(x, self.parse_other) for x in self.other_urls] return requests def parse_shop(self, response): pass # ... scrape shop here ... def parse_other(self, response): pass # ... scrape other here ...
The above is the detailed content of Scrapy crawler introductory tutorial 4 Spider (crawler). For more information, please follow other related articles on the PHP Chinese website!