Home > Web Front-end > JS Tutorial > The whole process of making a crawler with NodeJS_node.js

The whole process of making a crawler with NodeJS_node.js

WBOY
Release: 2016-05-16 16:25:03
Original
1841 people have browsed it

Today, let’s learn alsotang’s crawler tutorial, and then follow the simple crawling of CNode.

Create project craelr-demo
We first create an Express project, and then delete all the contents of the app.js file, because we do not need to display the content on the Web for the time being. Of course, we can also directly npm install express in an empty folder to use the Express functions we need.

Target website analysis
As shown in the picture, this is a part of the div tag on the CNode homepage. We use this series of ids and classes to locate the information we need.

Use superagent to obtain source data

superagent is an Http library used by ajax API. Its usage is similar to jQuery. We initiate a get request through it and output the result in the callback function.

Copy code The code is as follows:

var express = require('express');
var url = require('url'); //Parse operation url
var superagent = require('superagent'); //Don't forget to npm install
for these three external dependencies var cheerio = require('cheerio');
var eventproxy = require('eventproxy');
var targetUrl = 'https://cnodejs.org/';
superagent.get(targetUrl)
.end(function (err, res) {
console.log(res);
});

Its res result is an object containing target url information, and the website content is mainly in its text (string).

Use cheerio to parse

cheerio acts as a server-side jQuery function. We first use its .load() to load HTML, and then filter elements through CSS selector.

Copy code The code is as follows:

var $ = cheerio.load(res.text);
//Filter data through CSS selector
$('#topic_list .topic_title').each(function (idx, element) {
console.log(element);
});

The result is an object. Call the .each(function(index, element)) function to traverse each object and return HTML DOM Elements.

The result of outputting console.log($element.attr('title')); is 广州 2014年12月06日 NodeParty 之 UC 场
Titles like console.log($element.attr('href')); are output as urls like /topic/545c395becbcb78265856eb2. Then use the url.resolve() function of NodeJS1 to complete the complete url.

Copy code The code is as follows:

superagent.get(tUrl)
.end(function (err, res) {
If (err) {
                return console.error(err);
}
        var topicUrls = [];
      var $ = cheerio.load(res.text);
//Get all links on the homepage
           $('#topic_list .topic_title').each(function (idx, element) {
            var $element = $(element);
            var href = url.resolve(tUrl, $element.attr('href'));
console.log(href);
                     //topicUrls.push(href);
        });
});

Use eventproxy to concurrently crawl the content of each topic
The tutorial shows examples of deeply nested (serial) methods and counter methods. Eventproxy uses event (parallel) methods to solve this problem. When all the crawling is completed, eventproxy receives the event message and automatically calls the processing function for you.

Copy code The code is as follows:

//Step one: Get an instance of eventproxy
var ep = new eventproxy();
//Step 2: Define the callback function for listening events.
//The after method is repeated monitoring
//params: eventname(String) event name, times(Number) number of listening times, callback callback function
ep.after('topic_html', topicUrls.length, function(topics){
// topics is an array, containing the 40 pairs
in ep.emit('topic_html', pair) 40 times //.map
topics = topics.map(function(topicPair){
            //use cheerio
        var topicUrl = topicPair[0];
        var topicHtml = topicPair[1];
        var $ = cheerio.load(topicHtml);
         return ({
               title: $('.topic_full_title').text().trim(),
            href: topicUrl,
               comment1: $('.reply_content').eq(0).text().trim()
        });
});
//outcome
console.log('outcome:');
console.log(topics);
});
//Step 3: Determine the
that releases the event message topicUrls.forEach(function (topicUrl) {
Superagent.get(topicUrl)
        .end(function (err, res) {
console.log('fetch ' topicUrl ' successful');
             ep.emit('topic_html', [topicUrl, res.text]);
        });
});

The results are as follows

Extended Exercise (Challenge)

Get message username and points

Find the class name of the user who commented in the source code of the article page. The classname is reply_author. As you can see from the first element of console.log $('.reply_author').get(0), everything we need to get is here.

First, let’s crawl an article and get everything we need at once.

Copy code The code is as follows:

var userHref = url.resolve(tUrl, $('.reply_author').get(0).attribs.href);
console.log(userHref);
console.log($('.reply_author').get(0).children[0].data);

We can capture points information through https://cnodejs.org/user/username

Copy code The code is as follows:

$('.reply_author').each(function (idx, element) {
var $element = $(element);
console.log($element.attr('href'));
});

On the user information page $('.big').text().trim() is the points information.

Use cheerio’s function .get(0) to get the first element.

Copy code The code is as follows:

var userHref = url.resolve(tUrl, $('.reply_author').get(0).attribs.href);
console.log(userHref);

This is just a capture of a single article, there are still 40 that need to be modified.

Related labels:
source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template