Home Web Front-end JS Tutorial Example of simple web crawling function implemented by Node.js_node.js

Example of simple web crawling function implemented by Node.js_node.js

May 16, 2016 pm 04:28 PM
node.js web scraping

Nowadays, web crawling is a well-known technology, but there are still many complexities. Simple web crawlers are still difficult to cope with various complex technologies such as Ajax rotation training, XMLHttpRequest, WebSockets, Flash Sockets, etc. Developed modern website.

Let’s take our basic needs on the Hubdoc project as an example. In this project, we scrape bill amounts, due dates, account numbers, and most importantly, from the websites of banks, utilities, and credit card companies. of: pdf of recent bills. For this project, I started with a very simple solution (not using the expensive commercial products we are evaluating for the time being) - a simple crawler project I had done before using Perl at MessageLab/Symantec. But the results were disastrous. Spammers created websites that were much simpler than those of banks and utility companies.

So how to solve this problem? We mainly started by using the excellent request library developed by Mikea. Make a request in the browser, check in the Network window what request headers are sent, and then copy these request headers into the code. The process is simple. It just tracks the process from logging in to downloading the PDF file, and then simulates all the requests in this process. In order to make it easier to handle similar things and allow web developers to write crawler programs more rationally, I exported the method of getting the results from HTML into jQuery (using the lightweight cheerio library), which makes similar tasks easy, and also makes it easier to use CSS selectors to select elements on a page. The entire process is wrapped into a framework, and this framework can also do additional work, such as picking up certificates from the database, loading individual robots, and communicating with the UI through socket.io.

This works for some web sites, but it is just a JS script, not my node.js code that these companies put on their sites. They've layered the legacy issues on complexity, making it very difficult to figure out what to do to get the login information point. I tried for several days to get some sites by combining with the request() library, but still in vain.

After nearly crashing, I discovered node-phantomjs, a library that allows me to control the phantomjs headless webkit browser from node (Translator’s Note: I don’t have this) Thinking of a corresponding noun, headless here means that rendering the page is completed in the background without a display device). This seems like a simple solution, but there are still some unavoidable problems with phantomjs that need to be solved:

1. PhantomJS can only tell you whether the page has been loaded, but you cannot determine whether there is a redirect through JavaScript or meta tags in the process. Especially when JavaScript uses setTimeout() to delay calls.

2.PhantomJS provides you with a pageLoadStarted hook that allows you to handle the issues mentioned above, but this function can only be used when you determine the number of pages to be loaded and when each page is loaded. Decrease this number, and provide handling for possible timeouts (since this doesn't always happen) so that when your number decreases to 0, your callback function is called. This approach works, but it always feels a bit like a hack.

3. PhantomJS requires a complete independent process for each page it crawls, because otherwise, the cookies between each page cannot be separated. If you use the same phantomjs process, the session in the logged in page will be sent to another page.

4. Unable to download resources using PhantomJS - you can only save the page as png or pdf. This is useful, but it means we need to resort to request() to download the pdf.

5. Due to the above reasons, I must find a way to distribute cookies from the session of PhantomJS to the session library of request(). Just distribute the document.cookie string, parse it, and inject it into the request() cookie jar.

6. Injecting variables into the browser session is not easy. To do this I need to create a string to create a Javascript function.

Copy code The code is as follows:

Robot.prototype.add_page_data = function (page, name, data) {
page.evaluate(
"function () { var " name " = window." name " = " JSON.stringify(data) "}"
);
}

7. Some websites are always full of codes such as console.log(), and they need to be redefined and output to the location we want. To accomplish this, I do this:
Copy code The code is as follows:

if (!console.log) {
var iframe = document.createElement("iframe");
​ document.body.appendChild(iframe);
console = window.frames[0].console;
}

8. Some websites are always full of codes such as console.log(), and they need to be redefined and output to the location we want. To accomplish this, I do this:

Copy code The code is as follows:

if (!console.log) {
var iframe = document.createElement("iframe");
​ document.body.appendChild(iframe);
console = window.frames[0].console;
}

9. It is not easy to tell the browser that I clicked on the a tag. In order to complete these things, I added the following code:
Copy code The code is as follows:

var clickElement = window.clickElement = function (id){
var a = document.getElementById(id);
var e = document.createEvent("MouseEvents");
e.initMouseEvent("click", true, true, window, 0, 0, 0, 0, 0, false, false, false, false, 0, null);
a.dispatchEvent(e);
};

10. I also need to limit the maximum concurrency of browser sessions to ensure that we will not blow up the server. Having said that, this limit is much higher than what expensive commercial solutions can provide. (Translator’s note: The commercial solution has greater concurrency than this solution)

After all the work is done, I have a decent PhantomJS request crawler solution. You must log in using PhantomJS before you can return to the request() request. It will use the cookie set in PhantomJS to authenticate the logged in session. This is a huge win because we can use request()'s stream to download the pdf file.

The whole plan is to make it relatively easy for web developers to understand how to use jQuery and CSS selectors to create crawlers for different web sites. I have not yet successfully proven that this idea is feasible, but I believe it will soon.

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
2 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
Repo: How To Revive Teammates
4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
Hello Kitty Island Adventure: How To Get Giant Seeds
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

An article about memory control in Node An article about memory control in Node Apr 26, 2023 pm 05:37 PM

The Node service built based on non-blocking and event-driven has the advantage of low memory consumption and is very suitable for handling massive network requests. Under the premise of massive requests, issues related to "memory control" need to be considered. 1. V8’s garbage collection mechanism and memory limitations Js is controlled by the garbage collection machine

Detailed graphic explanation of the memory and GC of the Node V8 engine Detailed graphic explanation of the memory and GC of the Node V8 engine Mar 29, 2023 pm 06:02 PM

This article will give you an in-depth understanding of the memory and garbage collector (GC) of the NodeJS V8 engine. I hope it will be helpful to you!

Node.js 19 is officially released, let's talk about its 6 major features! Node.js 19 is officially released, let's talk about its 6 major features! Nov 16, 2022 pm 08:34 PM

Node 19 has been officially released. This article will give you a detailed explanation of the 6 major features of Node.js 19. I hope it will be helpful to you!

Let's talk in depth about the File module in Node Let's talk in depth about the File module in Node Apr 24, 2023 pm 05:49 PM

The file module is an encapsulation of underlying file operations, such as file reading/writing/opening/closing/delete adding, etc. The biggest feature of the file module is that all methods provide two versions of **synchronous** and **asynchronous**, with Methods with the sync suffix are all synchronization methods, and those without are all heterogeneous methods.

Let's talk about how to choose the best Node.js Docker image? Let's talk about how to choose the best Node.js Docker image? Dec 13, 2022 pm 08:00 PM

Choosing a Docker image for Node may seem like a trivial matter, but the size and potential vulnerabilities of the image can have a significant impact on your CI/CD process and security. So how do we choose the best Node.js Docker image?

What should I do if node cannot use npm command? What should I do if node cannot use npm command? Feb 08, 2023 am 10:09 AM

The reason why node cannot use the npm command is because the environment variables are not configured correctly. The solution is: 1. Open "System Properties"; 2. Find "Environment Variables" -> "System Variables", and then edit the environment variables; 3. Find the location of nodejs folder; 4. Click "OK".

Let's talk about the GC (garbage collection) mechanism in Node.js Let's talk about the GC (garbage collection) mechanism in Node.js Nov 29, 2022 pm 08:44 PM

How does Node.js do GC (garbage collection)? The following article will take you through it.

Let's talk about the event loop in Node Let's talk about the event loop in Node Apr 11, 2023 pm 07:08 PM

The event loop is a fundamental part of Node.js and enables asynchronous programming by ensuring that the main thread is not blocked. Understanding the event loop is crucial to building efficient applications. The following article will give you an in-depth understanding of the event loop in Node. I hope it will be helpful to you!

See all articles