The content of this article is about what is a crawler? The introduction of concepts in python web crawlers has certain reference value. Friends in need can refer to it. I hope it will be helpful to you.
Introduction to crawler related concepts
a) What is a crawler:
A crawler is written by The program simulates the process of a browser surfing the Internet, and then letting it crawl data from the Internet.
b) Which languages can implement crawlers:
1.php: Can implement crawlers. PHP is known as the most beautiful language in the world (of course it is its own claim, which means Wang Po sells melons), but PHP does not do well in supporting multi-threading and multi-process in crawlers.
2.java: Crawler can be implemented. Java can handle and implement crawlers very well. It is the only one that can keep pace with python and is python's number one rival. However, the Java crawler code is relatively bloated and the cost of reconstruction is high.
3.c, c: crawlers can be implemented. However, using this method to implement crawlers is purely a reflection of the abilities of some people (big guys), but it is not a wise and reasonable choice.
4.python: crawlers can be implemented. Python has simple syntax for implementing and processing crawlers, beautiful code, supports a wide range of modules, low learning cost, has a very powerful framework (scrapy, etc.) and is indescribably good! No but!
c) Classification of crawlers: According to usage scenarios, they can be divided into the following two categories
1. Universal crawlers: Universal crawlers are search engines (Baidu, Google, Yahoo, etc. ) an important part of the "crawling system". The main purpose is to download web pages on the Internet to the local computer to form a mirror backup of Internet content.
1) How do search engines crawl website data on the Internet?
a) The portal website actively provides the url of its website to the search engine company
b) The search engine company cooperates with the DNS service provider to obtain the url of the website
c) Portal The website actively links to the friendly links of some well-known websites
2. Focused crawler: Focused crawler crawls specified data on the network based on specified needs. For example: get the name and movie reviews of the movie on Douban instead of getting all the data values in the entire page.
d) robots.txt protocol:
If you do not want the data in the specified page in your portal website to be crawled by the crawler program, you can pass Write a robots.txt protocol file to constrain the data crawling of the crawler program. The writing format of the robots protocol can be observed on Taobao's robots (just visit www.taobao.com/robots.txt). However, it should be noted that this agreement is only equivalent to a verbal agreement and does not use relevant technologies for mandatory control. Therefore, this agreement is to guard against gentlemen and not against villains. But the crawler program we write in the crawler learning stage can ignore the robots protocol first.
e) Anti-crawler:
The portal uses corresponding strategies and technical means to prevent crawler programs from crawling website data.
f) Anti-crawler:
The crawler program uses corresponding strategies and technical means to crack the anti-crawler means of the portal website, thereby crawling the corresponding data.
The above is the detailed content of What is a crawler? Introduction to concepts in python web crawler. For more information, please follow other related articles on the PHP Chinese website!