Home Web Front-end JS Tutorial What is the puppeteer crawler? How crawlers work

What is the puppeteer crawler? How crawlers work

Nov 19, 2018 pm 05:58 PM
javascript web crawler

The content of this article is to introduce what is the puppeteer crawler? How crawlers work. It has certain reference value. Friends in need can refer to it. I hope it will be helpful to you.

What is a puppeteer?

Crawler is also called a network robot. Maybe you use search engines every day. Crawlers are an important part of search engines, crawling content for indexing. Nowadays, big data and data analysis are very popular. So where does the data come from? It can be crawled through web crawlers. Then let me discuss web crawlers.

What is the puppeteer crawler? How crawlers work

The working principle of the crawler

As shown in the figure, this is the flow chart of the crawler. It can be seen that the crawling journey of the crawler is started through a seed URL. By downloading the web page, the content in the web page is parsed and stored. At the same time, the URL in the parsed web page is removed from duplication and added to the queue waiting to be crawled. Then get the next URL waiting to be crawled from the queue and repeat the above steps. Isn't it very simple?

Breadth (BFS) or depth (DFS) priority strategy

It is also mentioned above that after crawling a web page, wait for crawling Select a URL from the queue to crawl, so how to choose? Should you select the URL in the current crawled web page or continue to select the same level URL in the current URL? The same-level URL here refers to the URL from the same web page, which is the difference between crawling strategies.

What is the puppeteer crawler? How crawlers work

Breadth First Strategy (BFS)

The breadth first strategy is to crawl the URL of a current web page completely first. Then crawl the URL crawled from the URL in the current web page. This is BFS. If the relationship diagram above represents the relationship between web pages, then the crawling strategy of BFS will be: (A->(B,D, F,G)->(C,F));

Depth First Strategy (DFS)

Depth First Strategy crawls a web page and then continues Crawl the URL parsed from the web page until the crawl is completed.
(A->B->C->D->E->F->G)

##Download page

Downloading a web page seems very simple, just like entering a link in the browser, and the browser will display it after the download is completed. Of course the result is not that simple.

Simulated login

For some web pages, you need to log in to see the content on the web page. How does the crawler log in? In fact, the login process is to obtain the access credentials (cookie, token...)

let cookie = '';
let j = request.jar()
async function login() {
    if (cookie) {
        return await Promise.resolve(cookie);
    }
    return await new Promise((resolve, reject) => {
        request.post({
            url: 'url',
            form: {
                m: 'username',
                p: 'password',
            },
            jar: j
        }, function(err, res, body) {
            if (err) {
                reject(err);
                return;
            }
            cookie = j.getCookieString('url');
            resolve(cookie);
        })
    })
}
Copy after login
Here is a simple chestnut, log in to obtain the cookie, and then bring the cookie with each request.

Get web content

Some web content is rendered on the server side. There is no CGI to obtain data and the content can only be parsed from html. However, the content of some websites is not simple. Obtaining content, websites like LinkedIn are not simply able to obtain web page content. The web page needs to be executed through the browser to obtain the final html structure. So how to solve it? I mentioned browser execution earlier, but do I have a programmable browser? Puppeteer, the open source headless browser project of the Google Chrome team, can use the headless browser to simulate user access, obtain the content of the most important web pages, and crawl the content.

Use puppeteer to simulate login

async function login(username, password) {
    const browser = await puppeteer.launch();
    page = await browser.newPage();
    await page.setViewport({
        width: 1400,
        height: 1000
    })
    await page.goto('https://example.cn/login');
    console.log(page.url())
    await page.focus('input[type=text]');
    await page.type(username, { delay: 100 });
    await page.focus('input[type=password]');
    await page.type(password, { delay: 100 });
    await page.$eval("input[type=submit]", el => el.click());
    await page.waitForNavigation();
    return page;
}
Copy after login
After executing

login(), you can get the content in html just like you logged in in the browser. , when letting w Oh Meng, you can also directly request CGI

async function crawlData(index, data) {
                    let dataUrl = `https://example.cn/company/contacts?count=20&page=${index}&query=&dist=0&cid=${cinfo.cid}&company=${cinfo.encodename}&forcomp=1&searchTokens=&highlight=false&school=&me=&webcname=&webcid=&jsononly=1`;
                    await page.goto(dataUrl);
                    // ...
                }
Copy after login
Like some websites, the cookie will be the same every time you crawl it. You can also use a headless browser to crawl it, so you don’t have to crawl it every time. Worry about cookies every time you crawl.

Write at the end

Of course, crawlers are not only about these, but also analyze the website. , find a suitable crawler strategy. Regarding

puppeteer, it can not only be used for crawlers, because it can be programmed, a headless browser, and can be used for automated testing and so on.

The above is the detailed content of What is the puppeteer crawler? How crawlers work. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

How to implement an online speech recognition system using WebSocket and JavaScript How to implement an online speech recognition system using WebSocket and JavaScript Dec 17, 2023 pm 02:54 PM

How to use WebSocket and JavaScript to implement an online speech recognition system Introduction: With the continuous development of technology, speech recognition technology has become an important part of the field of artificial intelligence. The online speech recognition system based on WebSocket and JavaScript has the characteristics of low latency, real-time and cross-platform, and has become a widely used solution. This article will introduce how to use WebSocket and JavaScript to implement an online speech recognition system.

WebSocket and JavaScript: key technologies for implementing real-time monitoring systems WebSocket and JavaScript: key technologies for implementing real-time monitoring systems Dec 17, 2023 pm 05:30 PM

WebSocket and JavaScript: Key technologies for realizing real-time monitoring systems Introduction: With the rapid development of Internet technology, real-time monitoring systems have been widely used in various fields. One of the key technologies to achieve real-time monitoring is the combination of WebSocket and JavaScript. This article will introduce the application of WebSocket and JavaScript in real-time monitoring systems, give code examples, and explain their implementation principles in detail. 1. WebSocket technology

How to use JavaScript and WebSocket to implement a real-time online ordering system How to use JavaScript and WebSocket to implement a real-time online ordering system Dec 17, 2023 pm 12:09 PM

Introduction to how to use JavaScript and WebSocket to implement a real-time online ordering system: With the popularity of the Internet and the advancement of technology, more and more restaurants have begun to provide online ordering services. In order to implement a real-time online ordering system, we can use JavaScript and WebSocket technology. WebSocket is a full-duplex communication protocol based on the TCP protocol, which can realize real-time two-way communication between the client and the server. In the real-time online ordering system, when the user selects dishes and places an order

How to implement an online reservation system using WebSocket and JavaScript How to implement an online reservation system using WebSocket and JavaScript Dec 17, 2023 am 09:39 AM

How to use WebSocket and JavaScript to implement an online reservation system. In today's digital era, more and more businesses and services need to provide online reservation functions. It is crucial to implement an efficient and real-time online reservation system. This article will introduce how to use WebSocket and JavaScript to implement an online reservation system, and provide specific code examples. 1. What is WebSocket? WebSocket is a full-duplex method on a single TCP connection.

JavaScript and WebSocket: Building an efficient real-time weather forecasting system JavaScript and WebSocket: Building an efficient real-time weather forecasting system Dec 17, 2023 pm 05:13 PM

JavaScript and WebSocket: Building an efficient real-time weather forecast system Introduction: Today, the accuracy of weather forecasts is of great significance to daily life and decision-making. As technology develops, we can provide more accurate and reliable weather forecasts by obtaining weather data in real time. In this article, we will learn how to use JavaScript and WebSocket technology to build an efficient real-time weather forecast system. This article will demonstrate the implementation process through specific code examples. We

How to use insertBefore in javascript How to use insertBefore in javascript Nov 24, 2023 am 11:56 AM

Usage: In JavaScript, the insertBefore() method is used to insert a new node in the DOM tree. This method requires two parameters: the new node to be inserted and the reference node (that is, the node where the new node will be inserted).

Simple JavaScript Tutorial: How to Get HTTP Status Code Simple JavaScript Tutorial: How to Get HTTP Status Code Jan 05, 2024 pm 06:08 PM

JavaScript tutorial: How to get HTTP status code, specific code examples are required. Preface: In web development, data interaction with the server is often involved. When communicating with the server, we often need to obtain the returned HTTP status code to determine whether the operation is successful, and perform corresponding processing based on different status codes. This article will teach you how to use JavaScript to obtain HTTP status codes and provide some practical code examples. Using XMLHttpRequest

JavaScript and WebSocket: Building an efficient real-time image processing system JavaScript and WebSocket: Building an efficient real-time image processing system Dec 17, 2023 am 08:41 AM

JavaScript is a programming language widely used in web development, while WebSocket is a network protocol used for real-time communication. Combining the powerful functions of the two, we can create an efficient real-time image processing system. This article will introduce how to implement this system using JavaScript and WebSocket, and provide specific code examples. First, we need to clarify the requirements and goals of the real-time image processing system. Suppose we have a camera device that can collect real-time image data

See all articles