


Example of simple web crawling function implemented by Node.js_node.js
Nowadays, web crawling is a well-known technology, but there are still many complexities. Simple web crawlers are still difficult to cope with various complex technologies such as Ajax rotation training, XMLHttpRequest, WebSockets, Flash Sockets, etc. Developed modern website.
Let’s take our basic needs on the Hubdoc project as an example. In this project, we scrape bill amounts, due dates, account numbers, and most importantly, from the websites of banks, utilities, and credit card companies. of: pdf of recent bills. For this project, I started with a very simple solution (not using the expensive commercial products we are evaluating for the time being) - a simple crawler project I had done before using Perl at MessageLab/Symantec. But the results were disastrous. Spammers created websites that were much simpler than those of banks and utility companies.
So how to solve this problem? We mainly started by using the excellent request library developed by Mikea. Make a request in the browser, check in the Network window what request headers are sent, and then copy these request headers into the code. The process is simple. It just tracks the process from logging in to downloading the PDF file, and then simulates all the requests in this process. In order to make it easier to handle similar things and allow web developers to write crawler programs more rationally, I exported the method of getting the results from HTML into jQuery (using the lightweight cheerio library), which makes similar tasks easy, and also makes it easier to use CSS selectors to select elements on a page. The entire process is wrapped into a framework, and this framework can also do additional work, such as picking up certificates from the database, loading individual robots, and communicating with the UI through socket.io.
This works for some web sites, but it is just a JS script, not my node.js code that these companies put on their sites. They've layered the legacy issues on complexity, making it very difficult to figure out what to do to get the login information point. I tried for several days to get some sites by combining with the request() library, but still in vain.
After nearly crashing, I discovered node-phantomjs, a library that allows me to control the phantomjs headless webkit browser from node (Translator’s Note: I don’t have this) Thinking of a corresponding noun, headless here means that rendering the page is completed in the background without a display device). This seems like a simple solution, but there are still some unavoidable problems with phantomjs that need to be solved:
1. PhantomJS can only tell you whether the page has been loaded, but you cannot determine whether there is a redirect through JavaScript or meta tags in the process. Especially when JavaScript uses setTimeout() to delay calls.
2.PhantomJS provides you with a pageLoadStarted hook that allows you to handle the issues mentioned above, but this function can only be used when you determine the number of pages to be loaded and when each page is loaded. Decrease this number, and provide handling for possible timeouts (since this doesn't always happen) so that when your number decreases to 0, your callback function is called. This approach works, but it always feels a bit like a hack.
3. PhantomJS requires a complete independent process for each page it crawls, because otherwise, the cookies between each page cannot be separated. If you use the same phantomjs process, the session in the logged in page will be sent to another page.
4. Unable to download resources using PhantomJS - you can only save the page as png or pdf. This is useful, but it means we need to resort to request() to download the pdf.
5. Due to the above reasons, I must find a way to distribute cookies from the session of PhantomJS to the session library of request(). Just distribute the document.cookie string, parse it, and inject it into the request() cookie jar.
6. Injecting variables into the browser session is not easy. To do this I need to create a string to create a Javascript function.
Robot.prototype.add_page_data = function (page, name, data) {
page.evaluate(
"function () { var " name " = window." name " = " JSON.stringify(data) "}"
);
}
7. Some websites are always full of codes such as console.log(), and they need to be redefined and output to the location we want. To accomplish this, I do this:
if (!console.log) {
var iframe = document.createElement("iframe");
document.body.appendChild(iframe);
console = window.frames[0].console;
}
8. Some websites are always full of codes such as console.log(), and they need to be redefined and output to the location we want. To accomplish this, I do this:
if (!console.log) {
var iframe = document.createElement("iframe");
document.body.appendChild(iframe);
console = window.frames[0].console;
}
9. It is not easy to tell the browser that I clicked on the a tag. In order to complete these things, I added the following code:
var clickElement = window.clickElement = function (id){
var a = document.getElementById(id);
var e = document.createEvent("MouseEvents");
e.initMouseEvent("click", true, true, window, 0, 0, 0, 0, 0, false, false, false, false, 0, null);
a.dispatchEvent(e);
};
10. I also need to limit the maximum concurrency of browser sessions to ensure that we will not blow up the server. Having said that, this limit is much higher than what expensive commercial solutions can provide. (Translator’s note: The commercial solution has greater concurrency than this solution)
After all the work is done, I have a decent PhantomJS request crawler solution. You must log in using PhantomJS before you can return to the request() request. It will use the cookie set in PhantomJS to authenticate the logged in session. This is a huge win because we can use request()'s stream to download the pdf file.
The whole plan is to make it relatively easy for web developers to understand how to use jQuery and CSS selectors to create crawlers for different web sites. I have not yet successfully proven that this idea is feasible, but I believe it will soon.

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics

The Node service built based on non-blocking and event-driven has the advantage of low memory consumption and is very suitable for handling massive network requests. Under the premise of massive requests, issues related to "memory control" need to be considered. 1. V8’s garbage collection mechanism and memory limitations Js is controlled by the garbage collection machine

This article will give you an in-depth understanding of the memory and garbage collector (GC) of the NodeJS V8 engine. I hope it will be helpful to you!

Node 19 has been officially released. This article will give you a detailed explanation of the 6 major features of Node.js 19. I hope it will be helpful to you!

The file module is an encapsulation of underlying file operations, such as file reading/writing/opening/closing/delete adding, etc. The biggest feature of the file module is that all methods provide two versions of **synchronous** and **asynchronous**, with Methods with the sync suffix are all synchronization methods, and those without are all heterogeneous methods.

Choosing a Docker image for Node may seem like a trivial matter, but the size and potential vulnerabilities of the image can have a significant impact on your CI/CD process and security. So how do we choose the best Node.js Docker image?

The reason why node cannot use the npm command is because the environment variables are not configured correctly. The solution is: 1. Open "System Properties"; 2. Find "Environment Variables" -> "System Variables", and then edit the environment variables; 3. Find the location of nodejs folder; 4. Click "OK".

How does Node.js do GC (garbage collection)? The following article will take you through it.

The event loop is a fundamental part of Node.js and enables asynchronous programming by ensuring that the main thread is not blocked. Understanding the event loop is crucial to building efficient applications. The following article will give you an in-depth understanding of the event loop in Node. I hope it will be helpful to you!
