What does crawling data mean?-Common Problem-php.cn

What does crawling data mean?

青灯夜游

Release： 2020-07-24 16:14:51

Original

29497 people have browsed it

Crawling data means: using a web crawler program to obtain the required content information on the website, such as text, video, pictures and other data. A web crawler (web spider) is a program or script that automatically crawls information from the World Wide Web according to certain rules.

What does crawling data mean?

What is the use of learning some knowledge about crawling data?

For example: search engines that are often used by everyone (Google, Sogou);

When users search for corresponding keywords on the Google search engine, Google will Keywords are analyzed, and the possible entries that are most suitable for the user are found from the "included" web pages and presented to the user; then, how to obtain these web pages is what the crawler needs to do, and of course how to push the most valuable web pages to the user is also It needs to be combined with the corresponding algorithm, which involves the knowledge of data mining;

For smaller applications, for example, we count the workload of testing work, which requires counting the number of modification orders per week/month , the number of defects recorded by jira and the specific content;

There is also the recent hot World Cup, if you want to count the data of each player/country, and store these data for other purposes;

Alternatively, you can do some analysis based on your own interests and hobbies through some data (statistics on the popularity of a book/movie). This requires crawling the data of existing web pages, and then doing some analysis with the obtained data. Specific analysis/statistical work, etc.

What basic knowledge is needed to learn a simple crawler?

I divide the basic knowledge into two parts:

1. Front-end basic knowledge

HTML/JSON, CSS; Ajax

Reference materials ：

http://www.w3school.com.cn/h.asp

http://www.w3school.com.cn/ajax/

http: //www.w3school.com.cn/json/

https://www.php.cn/course/list/1.html

https://www.php.cn /course/list/2.html

https://www.html.cn/

2. Python programming related knowledge

(1) Python basics Knowledge

Basic grammar knowledge, dictionaries, lists, functions, regular expressions, JSON, etc.

Reference materials:

http://www.runoob.com /python3/python3-tutorial.html

https://www.py.cn/

https://www.php.cn/course/list/30.html

(2) Python commonly used libraries:

Usage of Python's urllib library (I use more urlretrieve functions in this module, mainly using it to save some acquired resources (documents/pictures/mp3 /Video, etc.))

Python’s pyMysql library (database connection and addition, deletion, modification and query)

python module bs4 (requires css selector, html tree structure domTree knowledge, etc., according to css Selector/html tag/attribute to locate the content we need)

Python's requests (as the name suggests, this module is used to send request requests/POST/Get, etc., to obtain a Response object)

python's os module (this module provides a very rich method for processing files and directories. The os.path.join/exists function is more commonly used)

Reference materials: For this part, you can refer to the relevant modules Interface API document

Extended information:

The web crawler is a program that automatically extracts web pages. It downloads web pages from the World Wide Web for search engines and is an important component of search engines. .

Traditional crawlers start from the URL of one or several initial web pages and obtain the URL on the initial web page. During the process of crawling the web page, they continuously extract new URLs from the current page and put them into the queue until the system requirements are met. Certain stopping conditions.

The workflow of the focused crawler is more complicated. It needs to filter links unrelated to the topic according to a certain web page analysis algorithm, retain useful links and put them into the URL queue waiting to be crawled. Then, it will select the web page URL to be crawled next from the queue according to a certain search strategy, and repeat the above process until it stops when a certain condition of the system is reached.

In addition, all web pages crawled by crawlers will be stored by the system, subjected to certain analysis, filtering, and indexing for subsequent query and retrieval; for focused crawlers, this process requires The obtained analysis results may also provide feedback and guidance for future crawling processes.

Compared with general web crawlers, focused crawlers also need to solve three main problems:

(1) Description or definition of the crawling target;

(2) Analysis and filtering of web pages or data;

(3) Search strategy for URLs.