How to use Python to make a crawler

高洛峰
Release: 2016-11-23 13:23:37
Original
1226 people have browsed it

Getting Started" is a good motivation, but it may be slow. If you have a project in your hands or in your mind, then in practice you will be driven by the goal, instead of learning slowly like a learning module.

In addition, if you talk about knowledge If each knowledge point in the system is a point in the graph and the dependency is an edge, then the graph must not be a directed acyclic graph, because the experience of learning A can help you learn B. Therefore, you do not need to learn how. "Getting started", because such a "getting started" point does not exist! What you need to learn is how to make something larger. In the process, you will quickly learn what you need to learn. Of course, you can. The argument is that you need to know python first, otherwise how can you learn python to make a crawler? But in fact, you can learn python in the process of making this crawler :D

I saw the "technique" mentioned in many previous answers - what to use? How does the software crawl? Let me talk about the "Tao" and "Technology" - how the crawler works and how to implement it in python

Let's summarize it briefly:
You need to learn

the basic working principles of the crawler

Basic. http scraping tool, scrapy

Bloom Filter: Bloom Filters by Example

If you need to crawl web pages on a large scale, you need to learn the concept of distributed crawlers. In fact, it is not that mysterious. You only need to learn how to maintain a cluster of machines. The simplest implementation is python-rq: https://github.com/nvie/rq

The combination of rq and Scrapy: darkrho/scrapy-redis · GitHub

Follow-up processing, web page Disjunction (grangier/python-goose · GitHub), storage (Mongodb)


The following is a short story:

Tell me about the experience of climbing down the entire Douban when you wrote a cluster

1) First, you have to do it. Understand how crawlers work.
Imagine you are a spider, and now you are put on the Internet. So, what should you do? No problem, you can just click on it. Start somewhere, for example, the home page of the People's Daily. This is called initial pages, represented by $.

On the home page of the People's Daily, you see various links to that page, so you happily crawled to "domestic". "News" page. Great, now you have finished crawling two pages (homepage and domestic news)! For now, don't worry about how to deal with the page you crawled down. Just imagine that you copied this page completely into an html Put it on you.

Suddenly you find that on the domestic news page, there is a link back to the "Home Page", you must know that you don't have to crawl back, because you have already seen it. Ah. So, you need to use your brain to save the addresses of the pages you have viewed. In this way, every time you see a new link that may need to be crawled, you first check whether you have already visited this page address in your mind. If you've been there, don't go.

Okay, in theory, if all pages can be reached from the initial page, then it can be proved that you can definitely crawl all web pages.

So how to implement it in python?
Very simple
import Queueinitial_page = "http://www.renminribao.com"url_queue = Queue.Queue()seen = set()seen.insert(initial_page)url_queue.put(initial_page)while(True):

#Keep going until everything is dead
if url_queue.size()>0:
current_url = url_queue.get() #Get the first url in the queue
store(current_url) #Store the web page represented by this url _ For next_url in extract_urls (Current_url): #




IF NEXT_URL NORL NORL NORL NORL NOT In SEEN:
Seen.put (next_url)
Url_queue.put (next_url)




Else: a Break
is already written very well Pseudocode.

The backbone of all crawlers is here. Let’s analyze why crawlers are actually a very complicated thing - search engine companies usually have a whole team to maintain and develop them.

2) Efficiency
If you directly process the above code and run it directly, it will take you a whole year to crawl down the entire Douban content. Not to mention that search engines like Google need to crawl down the entire web.

What’s the problem? There are too many web pages that need to be crawled, and the above code is too slow. Assume that there are N websites in the entire network, then analyze the complexity of reuse judgment is N*log(N), because all web pages need to be traversed once, and reusing set every time requires log(N) complexity. OK, OK, I know that python's set implementation is hash - but this is still too slow, at least the memory usage is not efficient.

What is the usual way to determine weight? Bloom Filter. Simply put, it is still a hash method, but its characteristic is that it can use fixed memory (does not grow with the number of URLs) to determine whether the URL is already in the set with O(1) efficiency. Unfortunately, there is no such thing as a free lunch. The only problem is that if the URL is not in the set, BF can be 100% sure that the URL has not been viewed. But if this URL is in the set, it will tell you: This URL should have already appeared, but I have 2% uncertainty. Note that the uncertainty here can become very small when the memory you allocate is large enough. A simple tutorial: Bloom Filters by Example

Notice this feature. If the URL has been viewed, it may be viewed repeatedly with a small probability (it doesn’t matter, you won’t be exhausted if you view it more). But if it has not been viewed, it will definitely be viewed (this is very important, otherwise we will miss some web pages!). [IMPORTANT: There is a problem with this paragraph, please skip it for now]


Okay, now we are close to the fastest way to deal with the weight judgment. Another bottleneck - you only have one machine. No matter how big your bandwidth is, as long as the speed of your machine downloading web pages is the bottleneck, then you can only speed up this speed. If one machine isn't enough - use many! Of course, we assume that each machine has reached maximum efficiency - using multi-threading (for Python, multi-process).

3) Cluster crawling
When crawling Douban, I used a total of more than 100 machines to run around the clock for a month. Imagine if you only use one machine, you will have to run it for 100 months...

So, assuming you have 100 machines available now, how to use python to implement a distributed crawling algorithm?

We call 99 of the 100 machines with smaller computing power slaves, and the other larger machine is called master. Then looking back at the url_queue in the above code, if we can put this queue on this master On the machine, all slaves can communicate with the master through the network. Whenever a slave completes downloading a web page, it requests a new web page from the master to crawl. Every time the slave captures a new web page, it sends all the links on this web page to the master's queue. Similarly, the bloom filter is also placed on the master, but now the master only sends URLs that have not been visited to the slave. The Bloom Filter is placed in the memory of the master, and the visited URL is placed in Redis running on the master, thus ensuring that all operations are O(1). (At least the amortization is O(1). For the access efficiency of Redis, please see: LINSERT – Redis)

Consider how to implement it in python:

Install scrapy on each slave, then each machine will become a capable machine. Get a capable slave and install Redis and rq on the master to use as a distributed queue.


The code is written as

#slave.py
current_url = request_from_master()
to_send = []
for next_url in extract_urls(current_url):
    to_send.append(next_url)
store(current_url);
send_to_master(to_send)
#master.py
distributed_queue = DistributedQueue()
bf = BloomFilter()
initial_pages = "www.renmingribao.com"
while(True):
    if request == 'GET':
        if distributed_queue.size()>0:
            send(distributed_queue.get())
        else:
            break
    elif request == 'POST':
        bf.put(request.url)
Copy after login

Okay, in fact, as you can imagine, someone has already written what you need: darkrho/scrapy-redis · GitHub

4) Outlook and post-processing

Although the above uses a lot of " "Simple", but it is not easy to actually implement a commercial-scale crawler. The above code can be used to crawl an entire website without much problem.

But if you need these follow-up processing, such as 🎜🎜🎜effective storage (how the database should be arranged) 🎜🎜effective duplication judgment (here refers to webpage duplication judgment, we don’t want to compare the People’s Daily and Damin who plagiarized it) Crawled through daily newspapers)🎜

Effective information extraction (such as how to extract all the addresses on the web page, "Zhonghua Road, Fenjin Road, Chaoyang District"), search engines usually do not need to store all the information, such as why should I save pictures...

Timely updates (predict how often this page will be updated)


Related labels:
source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template
About us Disclaimer Sitemap
php.cn:Public welfare online PHP training,Help PHP learners grow quickly!