How to use Node.js crawler to implement web page requests-JS Tutorial-php.cn

Home

Web Front-end

JS Tutorial

How to use Node.js crawler to implement web page requests

亚连

Jun 12, 2018 pm 02:54 PM

node.js

This article mainly introduces the web request module of Node.js crawler. Now I will share it with you and give you a reference.

This article introduces the web request module of Node.js crawler and shares it with everyone. The details are as follows:

Note: If you download the latest nodegrass version, since some methods have been updated, the examples in this article It is no longer suitable. Please see the examples in the open source address for details.

1. Why should I write such a module?

The author wants to use Node.js to write a crawler. Although the method provided by Node.js official API to request remote resources is very simple, please refer to

http:// nodejs.org/api/http.html Two methods are provided for Http requests: http.get(options, callback) and http.request(options, callback).

You will know by looking at the method, get The method is used for get requests, while the request method provides more parameters, such as other request methods, the port of the requesting host, etc. Requests for Https are similar to Http. The simplest example:

var https = require(&#39;https&#39;);
https.get(&#39;https://encrypted.google.com/&#39;, function(res) {
 console.log("statusCode: ", res.statusCode);
 console.log("headers: ", res.headers);

 res.on(&#39;data&#39;, function(d) {
  process.stdout.write(d);
 });

}).on(&#39;error&#39;, function(e) {
 console.error(e);
});

Copy after login

For the above code, we just want to request the remote host and get the response information, such as response status, response header, and response body content. The second parameter of the get method is a callback function. We obtain the response information asynchronously. Then, in the callback function, the res object listens to data. The second parameter in the on method is another callback, and you get d (the response information you requested), it is very likely that callbacks will be introduced again when operating it, layer by layer, and finally faint. . . Regarding asynchronous programming, some students who are used to writing code in a synchronous way are very confused. Of course, some excellent synchronization libraries have been provided at home and abroad, such as Lao Zhao's Wind.js... It seems It's a bit far-fetched. In fact, what we ultimately want to get when calling get is the response information, and we don't care about the monitoring process such as res.on because it is too lazy. I don’t want to res.on('data',func) every time, so the nodegrass I want to introduce today was born.

2. Nodegrass requests resources, like Jquery’s $.get(url,func)

The simplest example:

var nodegrass = require(&#39;nodegrass&#39;);
nodegrass.get("http://www.baidu.com",function(data,status,headers){
  console.log(status);
  console.log(headers);
  console.log(data);
},&#39;gbk&#39;).on(&#39;error&#39;, function(e) {
  console.log("Got error: " + e.message);
});

Copy after login

What one Look, it’s no different from the official original get, it’s indeed almost the same =. =! It just lacks a layer of event monitoring callbacks of res.on('data',func). Believe it or not, I seem to feel much more comfortable anyway. The second parameter is also a callback function, in which the parameter data is the response body content, status is the response status, and headers are the response headers. After getting the response content, we can extract any information we are interested in from the obtained resources. Of course, in this example, it is just a simple printed console. The third parameter is the character encoding. Currently, Node.js does not support gbk. Nodegrass internally refers to iconv-lite for processing. Therefore, if the webpage encoding you request is gbk, such as Baidu. Just add this parameter.

So what about https requests? If it is an official api, you have to introduce the https module, but the request get method is similar to http, so nodegrass integrates them by the way. Look at the example:

var nodegrass = require(&#39;nodegrass&#39;);
nodegrass.get("https://github.com",function(data,status,headers){
  console.log(status);
  console.log(headers);
  console.log(data);
},&#39;utf8&#39;).on(&#39;error&#39;, function(e) {
  console.log("Got error: " + e.message);
});

Copy after login

nodegrass will automatically identify whether it is http or https based on the url. Of course, your url must have it. You cannot just write www.baidu.com/ but http://www.baidu.com/ .

For post requests, nodegrass provides the post method. See example:

var ng=require(&#39;nodegrass&#39;);
ng.post("https://api.weibo.com/oauth2/access_token",function(data,status,headers){
  var accessToken = JSON.parse(data);
  var err = null;
  if(accessToken.error){
     err = accessToken;
  }
  callback(err,accessToken);
  },headers,options,&#39;utf8&#39;);

Copy after login

The above is part of Sina Weibo Auth2.0 requesting accessToken, which uses nodegrass's post request access_token API.

Compared with the get method, the post method provides more headers request header parameters and options--post data. They are all types of object literals:

var headers = {
    &#39;Content-Type&#39;: &#39;application/x-www-form-urlencoded&#39;,
    &#39;Content-Length&#39;:data.length
  };

var options = {
       client_id : &#39;id&#39;,
     client_secret : &#39;cs&#39;,
     grant_type : &#39;authorization_code&#39;,
     redirect_uri : &#39;your callback url&#39;,
     code: acode
  };

Copy after login

3. Using nodegrass Be a proxy server? ……**

Look at the example:

var ng = require(&#39;nodegrass&#39;),
   http=require(&#39;http&#39;),
   url=require(&#39;url&#39;);

   http.createServer(function(req,res){
    var pathname = url.parse(req.url).pathname;
    
    if(pathname === &#39;/&#39;){
      ng.get(&#39;http://www.cnblogs.com/&#39;,function(data){
        res.writeHeader(200,{&#39;Content-Type&#39;:&#39;text/html;charset=utf-8&#39;});
        res.write(data+"\n");
        res.end();
        },&#39;utf8&#39;);
      }
   }).listen(8088);
   console.log(&#39;server listening 8088...&#39;);

Copy after login

It’s that simple. Of course, the proxy server is much more complicated. This does not count, but at least if you access the local port 8088, you will see Is it the page of the Blog Park?

The open source address of nodegrass: https://github.com/scottkiss/nodegrass

The above is what I compiled for everyone. I hope it will be helpful to everyone in the future.

JavaScript recursive traversal and non-recursive traversal

How to use the Upload upload component of element-ui in vue

How to implement calling between methods in vue

The above is the detailed content of How to use Node.js crawler to implement web page requests. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

R.E.P.O. Best Graphic Settings

4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Assassin's Creed Shadows: Seashell Riddle Solution

2 weeks ago By DDD

R.E.P.O. How to Fix Audio if You Can't Hear Anyone

4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

WWE 2K25: How To Unlock Everything In MyRise

1 months ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7507

CakePHP Tutorial

1378

What is the format of the account name of steam

win11 activation key permanent

nyt connections hints and answers

Related knowledge

An article about memory control in Node Apr 26, 2023 pm 05:37 PM

The Node service built based on non-blocking and event-driven has the advantage of low memory consumption and is very suitable for handling massive network requests. Under the premise of massive requests, issues related to "memory control" need to be considered. 1. V8’s garbage collection mechanism and memory limitations Js is controlled by the garbage collection machine

Detailed graphic explanation of the memory and GC of the Node V8 engine Mar 29, 2023 pm 06:02 PM

This article will give you an in-depth understanding of the memory and garbage collector (GC) of the NodeJS V8 engine. I hope it will be helpful to you!

Let's talk about how to choose the best Node.js Docker image? Dec 13, 2022 pm 08:00 PM

Choosing a Docker image for Node may seem like a trivial matter, but the size and potential vulnerabilities of the image can have a significant impact on your CI/CD process and security. So how do we choose the best Node.js Docker image?

Let's talk in depth about the File module in Node Apr 24, 2023 pm 05:49 PM

The file module is an encapsulation of underlying file operations, such as file reading/writing/opening/closing/delete adding, etc. The biggest feature of the file module is that all methods provide two versions of **synchronous** and **asynchronous**, with Methods with the sync suffix are all synchronization methods, and those without are all heterogeneous methods.

Node.js 19 is officially released, let's talk about its 6 major features! Nov 16, 2022 pm 08:34 PM

Node 19 has been officially released. This article will give you a detailed explanation of the 6 major features of Node.js 19. I hope it will be helpful to you!

Let's talk about the GC (garbage collection) mechanism in Node.js Nov 29, 2022 pm 08:44 PM

How does Node.js do GC (garbage collection)? The following article will take you through it.

Let's talk about the event loop in Node Apr 11, 2023 pm 07:08 PM

The event loop is a fundamental part of Node.js and enables asynchronous programming by ensuring that the main thread is not blocked. Understanding the event loop is crucial to building efficient applications. The following article will give you an in-depth understanding of the event loop in Node. I hope it will be helpful to you!

What should I do if node cannot use npm command? Feb 08, 2023 am 10:09 AM

The reason why node cannot use the npm command is because the environment variables are not configured correctly. The solution is: 1. Open "System Properties"; 2. Find "Environment Variables" -> "System Variables", and then edit the environment variables; 3. Find the location of nodejs folder; 4. Click "OK".

See all articles