Home Backend Development Python Tutorial How to use Python to make a crawler

How to use Python to make a crawler

Nov 23, 2016 pm 01:23 PM
python

Getting Started" is a good motivation, but it may be slow. If you have a project in your hands or in your mind, then in practice you will be driven by the goal, instead of learning slowly like a learning module.

In addition, if you talk about knowledge If each knowledge point in the system is a point in the graph and the dependency is an edge, then the graph must not be a directed acyclic graph, because the experience of learning A can help you learn B. Therefore, you do not need to learn how. "Getting started", because such a "getting started" point does not exist! What you need to learn is how to make something larger. In the process, you will quickly learn what you need to learn. Of course, you can. The argument is that you need to know python first, otherwise how can you learn python to make a crawler? But in fact, you can learn python in the process of making this crawler :D

I saw the "technique" mentioned in many previous answers - what to use? How does the software crawl? Let me talk about the "Tao" and "Technology" - how the crawler works and how to implement it in python

Let's summarize it briefly:
You need to learn

the basic working principles of the crawler

Basic. http scraping tool, scrapy

Bloom Filter: Bloom Filters by Example

If you need to crawl web pages on a large scale, you need to learn the concept of distributed crawlers. In fact, it is not that mysterious. You only need to learn how to maintain a cluster of machines. The simplest implementation is python-rq: https://github.com/nvie/rq

The combination of rq and Scrapy: darkrho/scrapy-redis · GitHub

Follow-up processing, web page Disjunction (grangier/python-goose · GitHub), storage (Mongodb)


The following is a short story:

Tell me about the experience of climbing down the entire Douban when you wrote a cluster

1) First, you have to do it. Understand how crawlers work.
Imagine you are a spider, and now you are put on the Internet. So, what should you do? No problem, you can just click on it. Start somewhere, for example, the home page of the People's Daily. This is called initial pages, represented by $.

On the home page of the People's Daily, you see various links to that page, so you happily crawled to "domestic". "News" page. Great, now you have finished crawling two pages (homepage and domestic news)! For now, don't worry about how to deal with the page you crawled down. Just imagine that you copied this page completely into an html Put it on you.

Suddenly you find that on the domestic news page, there is a link back to the "Home Page", you must know that you don't have to crawl back, because you have already seen it. Ah. So, you need to use your brain to save the addresses of the pages you have viewed. In this way, every time you see a new link that may need to be crawled, you first check whether you have already visited this page address in your mind. If you've been there, don't go.

Okay, in theory, if all pages can be reached from the initial page, then it can be proved that you can definitely crawl all web pages.

So how to implement it in python?
Very simple
import Queueinitial_page = "http://www.renminribao.com"url_queue = Queue.Queue()seen = set()seen.insert(initial_page)url_queue.put(initial_page)while(True):

#Keep going until everything is dead
if url_queue.size()>0:
current_url = url_queue.get() #Get the first url in the queue
store(current_url) #Store the web page represented by this url _ For next_url in extract_urls (Current_url): #




IF NEXT_URL NORL NORL NORL NORL NOT In SEEN:
Seen.put (next_url)
Url_queue.put (next_url)




Else: a Break
is already written very well Pseudocode.

The backbone of all crawlers is here. Let’s analyze why crawlers are actually a very complicated thing - search engine companies usually have a whole team to maintain and develop them.

2) Efficiency
If you directly process the above code and run it directly, it will take you a whole year to crawl down the entire Douban content. Not to mention that search engines like Google need to crawl down the entire web.

What’s the problem? There are too many web pages that need to be crawled, and the above code is too slow. Assume that there are N websites in the entire network, then analyze the complexity of reuse judgment is N*log(N), because all web pages need to be traversed once, and reusing set every time requires log(N) complexity. OK, OK, I know that python's set implementation is hash - but this is still too slow, at least the memory usage is not efficient.

What is the usual way to determine weight? Bloom Filter. Simply put, it is still a hash method, but its characteristic is that it can use fixed memory (does not grow with the number of URLs) to determine whether the URL is already in the set with O(1) efficiency. Unfortunately, there is no such thing as a free lunch. The only problem is that if the URL is not in the set, BF can be 100% sure that the URL has not been viewed. But if this URL is in the set, it will tell you: This URL should have already appeared, but I have 2% uncertainty. Note that the uncertainty here can become very small when the memory you allocate is large enough. A simple tutorial: Bloom Filters by Example

Notice this feature. If the URL has been viewed, it may be viewed repeatedly with a small probability (it doesn’t matter, you won’t be exhausted if you view it more). But if it has not been viewed, it will definitely be viewed (this is very important, otherwise we will miss some web pages!). [IMPORTANT: There is a problem with this paragraph, please skip it for now]


Okay, now we are close to the fastest way to deal with the weight judgment. Another bottleneck - you only have one machine. No matter how big your bandwidth is, as long as the speed of your machine downloading web pages is the bottleneck, then you can only speed up this speed. If one machine isn't enough - use many! Of course, we assume that each machine has reached maximum efficiency - using multi-threading (for Python, multi-process).

3) Cluster crawling
When crawling Douban, I used a total of more than 100 machines to run around the clock for a month. Imagine if you only use one machine, you will have to run it for 100 months...

So, assuming you have 100 machines available now, how to use python to implement a distributed crawling algorithm?

We call 99 of the 100 machines with smaller computing power slaves, and the other larger machine is called master. Then looking back at the url_queue in the above code, if we can put this queue on this master On the machine, all slaves can communicate with the master through the network. Whenever a slave completes downloading a web page, it requests a new web page from the master to crawl. Every time the slave captures a new web page, it sends all the links on this web page to the master's queue. Similarly, the bloom filter is also placed on the master, but now the master only sends URLs that have not been visited to the slave. The Bloom Filter is placed in the memory of the master, and the visited URL is placed in Redis running on the master, thus ensuring that all operations are O(1). (At least the amortization is O(1). For the access efficiency of Redis, please see: LINSERT – Redis)

Consider how to implement it in python:

Install scrapy on each slave, then each machine will become a capable machine. Get a capable slave and install Redis and rq on the master to use as a distributed queue.


The code is written as

#slave.py
current_url = request_from_master()
to_send = []
for next_url in extract_urls(current_url):
    to_send.append(next_url)
store(current_url);
send_to_master(to_send)
#master.py
distributed_queue = DistributedQueue()
bf = BloomFilter()
initial_pages = "www.renmingribao.com"
while(True):
    if request == 'GET':
        if distributed_queue.size()>0:
            send(distributed_queue.get())
        else:
            break
    elif request == 'POST':
        bf.put(request.url)
Copy after login

Okay, in fact, as you can imagine, someone has already written what you need: darkrho/scrapy-redis · GitHub

4) Outlook and post-processing

Although the above uses a lot of " "Simple", but it is not easy to actually implement a commercial-scale crawler. The above code can be used to crawl an entire website without much problem.

But if you need these follow-up processing, such as 🎜🎜🎜effective storage (how the database should be arranged) 🎜🎜effective duplication judgment (here refers to webpage duplication judgment, we don’t want to compare the People’s Daily and Damin who plagiarized it) Crawled through daily newspapers)🎜

Effective information extraction (such as how to extract all the addresses on the web page, "Zhonghua Road, Fenjin Road, Chaoyang District"), search engines usually do not need to store all the information, such as why should I save pictures...

Timely updates (predict how often this page will be updated)


Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Best Graphic Settings
4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. How to Fix Audio if You Can't Hear Anyone
4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
WWE 2K25: How To Unlock Everything In MyRise
1 months ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

HadiDB: A lightweight, horizontally scalable database in Python HadiDB: A lightweight, horizontally scalable database in Python Apr 08, 2025 pm 06:12 PM

HadiDB: A lightweight, high-level scalable Python database HadiDB (hadidb) is a lightweight database written in Python, with a high level of scalability. Install HadiDB using pip installation: pipinstallhadidb User Management Create user: createuser() method to create a new user. The authentication() method authenticates the user's identity. fromhadidb.operationimportuseruser_obj=user("admin","admin")user_obj.

Python: Exploring Its Primary Applications Python: Exploring Its Primary Applications Apr 10, 2025 am 09:41 AM

Python is widely used in the fields of web development, data science, machine learning, automation and scripting. 1) In web development, Django and Flask frameworks simplify the development process. 2) In the fields of data science and machine learning, NumPy, Pandas, Scikit-learn and TensorFlow libraries provide strong support. 3) In terms of automation and scripting, Python is suitable for tasks such as automated testing and system management.

The 2-Hour Python Plan: A Realistic Approach The 2-Hour Python Plan: A Realistic Approach Apr 11, 2025 am 12:04 AM

You can learn basic programming concepts and skills of Python within 2 hours. 1. Learn variables and data types, 2. Master control flow (conditional statements and loops), 3. Understand the definition and use of functions, 4. Quickly get started with Python programming through simple examples and code snippets.

Navicat's method to view MongoDB database password Navicat's method to view MongoDB database password Apr 08, 2025 pm 09:39 PM

It is impossible to view MongoDB password directly through Navicat because it is stored as hash values. How to retrieve lost passwords: 1. Reset passwords; 2. Check configuration files (may contain hash values); 3. Check codes (may hardcode passwords).

How to use AWS Glue crawler with Amazon Athena How to use AWS Glue crawler with Amazon Athena Apr 09, 2025 pm 03:09 PM

As a data professional, you need to process large amounts of data from various sources. This can pose challenges to data management and analysis. Fortunately, two AWS services can help: AWS Glue and Amazon Athena.

How to start the server with redis How to start the server with redis Apr 10, 2025 pm 08:12 PM

The steps to start a Redis server include: Install Redis according to the operating system. Start the Redis service via redis-server (Linux/macOS) or redis-server.exe (Windows). Use the redis-cli ping (Linux/macOS) or redis-cli.exe ping (Windows) command to check the service status. Use a Redis client, such as redis-cli, Python, or Node.js, to access the server.

How to read redis queue How to read redis queue Apr 10, 2025 pm 10:12 PM

To read a queue from Redis, you need to get the queue name, read the elements using the LPOP command, and process the empty queue. The specific steps are as follows: Get the queue name: name it with the prefix of "queue:" such as "queue:my-queue". Use the LPOP command: Eject the element from the head of the queue and return its value, such as LPOP queue:my-queue. Processing empty queues: If the queue is empty, LPOP returns nil, and you can check whether the queue exists before reading the element.

How to view server version of Redis How to view server version of Redis Apr 10, 2025 pm 01:27 PM

Question: How to view the Redis server version? Use the command line tool redis-cli --version to view the version of the connected server. Use the INFO server command to view the server's internal version and need to parse and return information. In a cluster environment, check the version consistency of each node and can be automatically checked using scripts. Use scripts to automate viewing versions, such as connecting with Python scripts and printing version information.

See all articles