Generation background
Editor
With the rapid development of the Internet, the World Wide Web has become a carrier of large amounts of information. How to effectively extract and utilize this information has become a huge challenge. Search engines, such as the traditional general search engines AltaVista, Yahoo! and Google, serve as a tool to assist people in retrieving information and become the entrance and guide for users to access the World Wide Web. However, these general search engines also have certain limitations, such as:
(1) Users in different fields and backgrounds often have different search purposes and needs. The results returned by general search engines include Web pages that a large number of users don't care about.
(2) The goal of a general search engine is to maximize network coverage. The contradiction between limited search engine server resources and unlimited network data resources will further deepen.
(3) With the richness of data forms on the World Wide Web and the continuous development of network technology, a large number of different data such as pictures, databases, audio, and video multimedia appear. General search engines often search for these data that are dense in information and have a certain structure. Incompetent to discover and acquire well.
(4) Most general search engines provide keyword-based retrieval, and it is difficult to support queries based on semantic information.
Web Crawler
In order to solve the above problems, focused crawlers that specifically capture relevant web page resources came into being. Focused crawler is a program that automatically downloads web pages. It selectively accesses web pages and related links on the World Wide Web based on established crawling goals to obtain the required information. Unlike general purpose web crawlers, focused crawlers do not pursue large coverage, but set the goal of crawling web pages related to a specific topic content and preparing data resources for topic-oriented user queries.
1 Focus on the working principle of crawlers and an overview of key technologies
A web crawler is a program that automatically extracts web pages. It downloads web pages from the World Wide Web for search engines and is an important component of search engines. The traditional crawler starts from the URL of one or several initial web pages and obtains the URL on the initial web page. During the process of crawling the web page, it continuously extracts new URLs from the current page and puts them into the queue until certain stopping conditions of the system are met. The workflow of the focused crawler is more complex, and it requires filtering links unrelated to the topic based on a certain web page analysis algorithm, retaining useful links and putting them into the URL queue waiting to be crawled. Then, it will select the web page URL to be crawled next from the queue according to a certain search strategy, and repeat the above process until it stops when a certain condition of the system is reached. In addition, all web pages crawled by crawlers will be stored by the system, subjected to certain analysis, filtering, and indexing for subsequent query and retrieval; for focused crawlers, the analysis results obtained in this process may also be Give feedback and guidance for future crawling processes.
Compared with general web crawlers, focused crawlers also need to solve three main problems:
(1) Description or definition of the crawling target;
(2) Analysis and filtering of web pages or data;
(3) Search strategy for URLs.
Website crawler mainly crawls all article content and titles under the blog and saves them in the data directory. The details are as follows:
import requestsimport re url = ''def get_html(url):#打开url并获取该url的所有html信息html_content = requests.get(url).text#从html_conten所有的html信息中匹配到所有博客的超链接地址href_list = re.findall(r'href=\"(.*)\"\>(.*)\<\/a\>', html_content)for line in href_list:#打开超链接地址line_html = requests.get(line[0]) conten = line[1] line_content = line_html.text line_encoding = line_html.encodingprint('文章标题:%s,文章编码:%s'%(conten, line_encoding)) get_html(url)
The above is the detailed content of A practical sharing of website crawlers. For more information, please follow other related articles on the PHP Chinese website!