I am learning Python recently, and I am also learning how to use python to capture data, so I discovered Scrapy, a very popular Python crawling framework. Let’s learn the architecture of Scrapy to make better use of this tool.
1. Overview
The following figure shows the general architecture of Scrapy, which includes its main components and the data processing process of the system (the green arrow indicates Show). Let’s explain the role of each component and the data processing process one by one.
2. Components
1. Scrapy Engine
The Scrapy engine is used to control the data processing process of the entire system and trigger transaction processing. For more details, please see the data processing process below.
2. Scheduler (Scheduling)
The scheduler accepts requests from the Scrapy engine and sorts them into the queue, and after the Scrapy engine sends the request Return it to them.
3. Downloader
The main responsibility of the downloader is to crawl web pages and return web content to Spiders. .
4. Spiders
Spiders are defined by Scrapy users themselves to parse web pages and crawl the content returned by the specified URL. Class, each spider can handle a domain name or a group of domain names. In other words, it is used to define the crawling and parsing rules for a specific website.
The entire crawling process (cycle) of the spider is as follows:
1). First get the initial request for the first URL, and call a callback function when the request returns. The first request is made by calling the start_requests() method. This method generates requests from the Url in start_urls by default and performs parsing to call the callback function.
2).In the callback function, you can parse the web page response and return an iteration of the item object, request object, or both. These requests will also include a callback, then be downloaded by Scrapy and then handled with the specified callback.
3). In the callback function, you parse the content of the website, using the Xpath selector (but you can also use BeautifuSoup, lxml or any other program you like), and generate parsed data items.
4).Finally, projects returned from spiders usually go to the project pipeline.
5. Item Pipeline
The main responsibility of the project pipeline is to process the items extracted from the web page by spiders. project, his main task is to clarify, verify and store data. When the page is parsed by the spider, it will be sent to the project pipeline and the data will be processed through several specific sequences. The components of each project pipeline are Python classes with a simple method. They acquire the project and execute their approach, but they also need to determine whether they need to continue to the next step in the project pipeline or discard it without processing.
The project pipeline usually performs the following processes:
1). Clean HTML data
2). Verify the parsed data (check whether the project contains necessary fields)
3 ). Check whether it is duplicate data (delete it if duplicated)
4). Store the parsed data in the database
6. Downloader middlewares Middleware)
Download middleware is a hook framework located between the Scrapy engine and the downloader. It mainly handles requests and responses between the Scrapy engine and the downloader. It provides a custom code way to extend Scrapy's functionality. The download broker is a hook framework that handles requests and responses. It is a lightweight, low-level system that enjoys global control over Scrapy.
7. Spider middlewares (Spider middleware)
Spider middleware is a hook framework between the Scrapy engine and the spider. It mainly The job is to process the spider's response input and request output. It provides a way to extend Scrapy's functionality with custom code. Spider middleware is a framework that hooks into Scrapy's spider processing mechanism. You can insert custom code to handle requests sent to spiders and return response content and items obtained by spiders.
8. Scheduler middlewares (Scheduling middleware)
Scheduling middleware is the middleware between the Scrapy engine and scheduling, mainly Work is done by sending requests and responses from the Scrapy engine to the scheduler. He provides a custom code to extend Scrapy's functionality.
3. Data processing process
The entire data processing process of Scrapy is controlled by the Scrapy engine, and its main operation mode is:
The engine opens a domain name, the spider processes the domain name, and lets the spider obtain the first crawled URL.
The engine obtains the first URL that needs to be crawled from the spider, and then schedules it as a request in the schedule.
The engine obtains the next page to crawl from the scheduler.
The schedule returns the next crawled URL to the engine, and the engine sends them to the downloader through the download middleware.
When the web page is downloaded by the downloader, the response content is sent to the engine through the download middleware.
The engine receives the response from the downloader and sends it to the spider through the spider middleware for processing.
The spider processes the response and returns the crawled items, and then sends a new request to the engine.
The engine will capture the project pipeline and send a request to the scheduler.
The system repeats the operations following the second step until there are no requests in the schedule, and then disconnects the engine from the domain.
4. Driver
Scrapy is a popular Python event-driven network framework written by Twisted. It uses non-blocking asynchronous processing.
The above is the entire content of this article. I hope it will be helpful to everyone's learning, and I also hope that everyone will subscribe to the PHP Chinese website.
For more articles related to the architecture of Scrapy, the Python crawling framework, please pay attention to the PHP Chinese website!