Home Backend Development Python Tutorial A brief introduction to the Python crawler framework Scrapy

A brief introduction to the Python crawler framework Scrapy

Oct 19, 2018 pm 05:04 PM
python

This article brings you a brief introduction to the Python crawler framework Scrapy. It has certain reference value. Friends in need can refer to it. I hope it will be helpful to you.

Scrapy Framework

Scrapy is an application framework written in pure Python to crawl website data and extract structural data. It has a wide range of uses.

With the power of the framework, users only need to customize and develop a few modules to easily implement a crawler to crawl web content and various images, which is very convenient.

Scrapy uses the Twisted'twɪstɪd asynchronous network framework to handle network communications, which can speed up our downloads without having to implement the asynchronous framework ourselves. It also contains various middleware interfaces and can flexibly complete various needs. .

Scrapy architecture diagram (the green line is the data flow direction):

95625f65089e4bc98a269cfda6701597.png

Scrapy Engine: Responsible for the communication between Spider, ItemPipeline, Downloader, and Scheduler. Signals, data transfer, etc.

Scheduler (scheduler): It is responsible for accepting Request requests sent by the engine, sorting them out in a certain way, entering them into the queue, and returning them to the engine when the engine needs them.

Downloader (Downloader): Responsible for downloading all Requests sent by Scrapy Engine (Engine), and returning the obtained Responses to Scrapy Engine (Engine), which is handed over to Spider for processing,

Spider (crawler): It is responsible for processing all Responses, analyzing and extracting data, obtaining the data required by the Item field, and submitting the URL that needs to be followed to the engine, and then entering the Scheduler (scheduler) again,

Item Pipeline (pipeline): It is responsible for processing the Item obtained from the Spider and performing post-processing (detailed analysis, filtering, storage, etc.).

Downloader Middlewares (download middleware): You It can be regarded as a component that can be customized to extend the download function.

Spider Middlewares (Spider middleware): You can understand it as a functional component that can customize the extension and operation engine and the middle communication between the Spider (such as Responses entering the Spider; and Requests out of the Spider)

b847d7fa404a404ca0a656028ada63b5.png

If you encounter many questions and problems in the process of learning Python, you can add -q-u-n 227 -435-450 There are software video materials for free

Scrapy The operation process

The code is written and the program starts to run...

Engine: Hi! Spider, which website are you working on?

Spider: The boss wants me to handle xxxx.com.

Engine: Give me the first URL that needs to be processed.

Spider: Here you go, the first URL is xxxxxxx.com.

Engine: Hi! Scheduler, I have a request here to ask you to help me sort the queues.

Scheduler: OK, processing. Please wait.

Engine: Hi! Scheduler, give me the request you processed.

Scheduler: Here you go, this is the request I have processed

Engine: Hi! Downloader, please help me download this request according to the boss's download middleware settings. Request

Downloader: OK! Here you go, here’s the download. (If it fails: sorry, the download of this request failed. Then the engine tells the scheduler that the download of this request failed. You record it and we will download it later)

Engine: Hi! Spider, this is something that has been downloaded and has been processed according to the boss's download middleware. You can handle it yourself (note! The responses here are handled by the def parse() function by default)

Spider : (for the URL that needs to be followed up after the data is processed), Hi! Engine, I have two results here, this is the URL I need to follow up, and this is the Item data I obtained.

Engine: Hi! Pipeline I have an item here. Please help me deal with it! scheduler! This is a URL that needs to be followed. Please help me deal with it. Then start the loop from step 4 until you have obtained all the information the boss needs.

Pipeline `` Scheduler: OK, do it now!

Notice! Only when there are no requests in the scheduler, the entire program will stop (that is, Scrapy will also re-download the URL that failed to download.)

There are 4 steps required to make a Scrapy crawler:

New project (scrapy startproject xxx): Create a new crawler project

Clear the goal (write items.py): Clear the goal you want to crawl

Make a crawler (spiders/xxspider.py): Make a crawler to start crawling web pages

Storage content (pipelines.py): Design pipelines to store crawled content

The above is the detailed content of A brief introduction to the Python crawler framework Scrapy. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Best Graphic Settings
4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. How to Fix Audio if You Can't Hear Anyone
4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
WWE 2K25: How To Unlock Everything In MyRise
1 months ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

The 2-Hour Python Plan: A Realistic Approach The 2-Hour Python Plan: A Realistic Approach Apr 11, 2025 am 12:04 AM

You can learn basic programming concepts and skills of Python within 2 hours. 1. Learn variables and data types, 2. Master control flow (conditional statements and loops), 3. Understand the definition and use of functions, 4. Quickly get started with Python programming through simple examples and code snippets.

How to read redis queue How to read redis queue Apr 10, 2025 pm 10:12 PM

To read a queue from Redis, you need to get the queue name, read the elements using the LPOP command, and process the empty queue. The specific steps are as follows: Get the queue name: name it with the prefix of "queue:" such as "queue:my-queue". Use the LPOP command: Eject the element from the head of the queue and return its value, such as LPOP queue:my-queue. Processing empty queues: If the queue is empty, LPOP returns nil, and you can check whether the queue exists before reading the element.

How to start the server with redis How to start the server with redis Apr 10, 2025 pm 08:12 PM

The steps to start a Redis server include: Install Redis according to the operating system. Start the Redis service via redis-server (Linux/macOS) or redis-server.exe (Windows). Use the redis-cli ping (Linux/macOS) or redis-cli.exe ping (Windows) command to check the service status. Use a Redis client, such as redis-cli, Python, or Node.js, to access the server.

How to set the Redis memory size according to business needs? How to set the Redis memory size according to business needs? Apr 10, 2025 pm 02:18 PM

Redis memory size setting needs to consider the following factors: data volume and growth trend: Estimate the size and growth rate of stored data. Data type: Different types (such as lists, hashes) occupy different memory. Caching policy: Full cache, partial cache, and phasing policies affect memory usage. Business Peak: Leave enough memory to deal with traffic peaks.

How to read data from redis How to read data from redis Apr 10, 2025 pm 07:30 PM

To read data from Redis, you can follow these steps: 1. Connect to the Redis server; 2. Use get(key) to get the value of the key; 3. If you need string values, decode the binary value; 4. Use exists(key) to check whether the key exists; 5. Use mget(keys) to get multiple values; 6. Use type(key) to get the data type; 7. Redis has other read commands, such as: getting all keys in a matching pattern, using cursors to iterate the keys, and sorting the key values.

Python vs. C  : Applications and Use Cases Compared Python vs. C : Applications and Use Cases Compared Apr 12, 2025 am 12:01 AM

Python is suitable for data science, web development and automation tasks, while C is suitable for system programming, game development and embedded systems. Python is known for its simplicity and powerful ecosystem, while C is known for its high performance and underlying control capabilities.

What to do if Redis memory usage is too high? What to do if Redis memory usage is too high? Apr 10, 2025 pm 02:21 PM

Redis memory soaring includes: too large data volume, improper data structure selection, configuration problems (such as maxmemory settings too small), and memory leaks. Solutions include: deletion of expired data, use compression technology, selecting appropriate structures, adjusting configuration parameters, checking for memory leaks in the code, and regularly monitoring memory usage.

What types of files are composed of oracle databases? What types of files are composed of oracle databases? Apr 11, 2025 pm 03:03 PM

Oracle database file structure includes: data file: storing actual data. Control file: Record database structure information. Redo log files: record transaction operations to ensure data consistency. Parameter file: Contains database running parameters to optimize performance. Archive log file: Backup redo log file for disaster recovery.

See all articles