


A brief introduction to the Python crawler framework Scrapy
This article brings you a brief introduction to the Python crawler framework Scrapy. It has certain reference value. Friends in need can refer to it. I hope it will be helpful to you.
Scrapy Framework
Scrapy is an application framework written in pure Python to crawl website data and extract structural data. It has a wide range of uses.
With the power of the framework, users only need to customize and develop a few modules to easily implement a crawler to crawl web content and various images, which is very convenient.
Scrapy uses the Twisted'twɪstɪd asynchronous network framework to handle network communications, which can speed up our downloads without having to implement the asynchronous framework ourselves. It also contains various middleware interfaces and can flexibly complete various needs. .
Scrapy architecture diagram (the green line is the data flow direction):
95625f65089e4bc98a269cfda6701597.png
Scrapy Engine: Responsible for the communication between Spider, ItemPipeline, Downloader, and Scheduler. Signals, data transfer, etc.
Scheduler (scheduler): It is responsible for accepting Request requests sent by the engine, sorting them out in a certain way, entering them into the queue, and returning them to the engine when the engine needs them.
Downloader (Downloader): Responsible for downloading all Requests sent by Scrapy Engine (Engine), and returning the obtained Responses to Scrapy Engine (Engine), which is handed over to Spider for processing,
Spider (crawler): It is responsible for processing all Responses, analyzing and extracting data, obtaining the data required by the Item field, and submitting the URL that needs to be followed to the engine, and then entering the Scheduler (scheduler) again,
Item Pipeline (pipeline): It is responsible for processing the Item obtained from the Spider and performing post-processing (detailed analysis, filtering, storage, etc.).
Downloader Middlewares (download middleware): You It can be regarded as a component that can be customized to extend the download function.
Spider Middlewares (Spider middleware): You can understand it as a functional component that can customize the extension and operation engine and the middle communication between the Spider (such as Responses entering the Spider; and Requests out of the Spider)
b847d7fa404a404ca0a656028ada63b5.png
If you encounter many questions and problems in the process of learning Python, you can add -q-u-n 227 -435-450 There are software video materials for free
Scrapy The operation process
The code is written and the program starts to run...
Engine: Hi! Spider, which website are you working on?
Spider: The boss wants me to handle xxxx.com.
Engine: Give me the first URL that needs to be processed.
Spider: Here you go, the first URL is xxxxxxx.com.
Engine: Hi! Scheduler, I have a request here to ask you to help me sort the queues.
Scheduler: OK, processing. Please wait.
Engine: Hi! Scheduler, give me the request you processed.
Scheduler: Here you go, this is the request I have processed
Engine: Hi! Downloader, please help me download this request according to the boss's download middleware settings. Request
Downloader: OK! Here you go, here’s the download. (If it fails: sorry, the download of this request failed. Then the engine tells the scheduler that the download of this request failed. You record it and we will download it later)
Engine: Hi! Spider, this is something that has been downloaded and has been processed according to the boss's download middleware. You can handle it yourself (note! The responses here are handled by the def parse() function by default)
Spider : (for the URL that needs to be followed up after the data is processed), Hi! Engine, I have two results here, this is the URL I need to follow up, and this is the Item data I obtained.
Engine: Hi! Pipeline I have an item here. Please help me deal with it! scheduler! This is a URL that needs to be followed. Please help me deal with it. Then start the loop from step 4 until you have obtained all the information the boss needs.
Pipeline `` Scheduler: OK, do it now!
Notice! Only when there are no requests in the scheduler, the entire program will stop (that is, Scrapy will also re-download the URL that failed to download.)
There are 4 steps required to make a Scrapy crawler:
New project (scrapy startproject xxx): Create a new crawler project
Clear the goal (write items.py): Clear the goal you want to crawl
Make a crawler (spiders/xxspider.py): Make a crawler to start crawling web pages
Storage content (pipelines.py): Design pipelines to store crawled content
The above is the detailed content of A brief introduction to the Python crawler framework Scrapy. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics



You can learn basic programming concepts and skills of Python within 2 hours. 1. Learn variables and data types, 2. Master control flow (conditional statements and loops), 3. Understand the definition and use of functions, 4. Quickly get started with Python programming through simple examples and code snippets.

To read a queue from Redis, you need to get the queue name, read the elements using the LPOP command, and process the empty queue. The specific steps are as follows: Get the queue name: name it with the prefix of "queue:" such as "queue:my-queue". Use the LPOP command: Eject the element from the head of the queue and return its value, such as LPOP queue:my-queue. Processing empty queues: If the queue is empty, LPOP returns nil, and you can check whether the queue exists before reading the element.

The steps to start a Redis server include: Install Redis according to the operating system. Start the Redis service via redis-server (Linux/macOS) or redis-server.exe (Windows). Use the redis-cli ping (Linux/macOS) or redis-cli.exe ping (Windows) command to check the service status. Use a Redis client, such as redis-cli, Python, or Node.js, to access the server.

Redis memory size setting needs to consider the following factors: data volume and growth trend: Estimate the size and growth rate of stored data. Data type: Different types (such as lists, hashes) occupy different memory. Caching policy: Full cache, partial cache, and phasing policies affect memory usage. Business Peak: Leave enough memory to deal with traffic peaks.

To read data from Redis, you can follow these steps: 1. Connect to the Redis server; 2. Use get(key) to get the value of the key; 3. If you need string values, decode the binary value; 4. Use exists(key) to check whether the key exists; 5. Use mget(keys) to get multiple values; 6. Use type(key) to get the data type; 7. Redis has other read commands, such as: getting all keys in a matching pattern, using cursors to iterate the keys, and sorting the key values.

Python is suitable for data science, web development and automation tasks, while C is suitable for system programming, game development and embedded systems. Python is known for its simplicity and powerful ecosystem, while C is known for its high performance and underlying control capabilities.

Redis memory soaring includes: too large data volume, improper data structure selection, configuration problems (such as maxmemory settings too small), and memory leaks. Solutions include: deletion of expired data, use compression technology, selecting appropriate structures, adjusting configuration parameters, checking for memory leaks in the code, and regularly monitoring memory usage.

Oracle database file structure includes: data file: storing actual data. Control file: Record database structure information. Redo log files: record transaction operations to ensure data consistency. Parameter file: Contains database running parameters to optimize performance. Archive log file: Backup redo log file for disaster recovery.
