The whole process of making a crawler with NodeJS (continued)

The whole process of making a crawler with NodeJS (continued)_node.js

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Release： 2016-05-16 16:25:02

Original

1022 people have browsed it

Continuing from the previous chapter, we need to modify the program to continuously capture the content of 40 pages. That is to say, we need to output the title, link, first comment, commenting user and forum points of each article.

As shown in the figure, the value obtained by $('.reply_author').eq(0).text().trim(); is the correct user who made the first comment.

{<1>}

After eventproxy obtains the comments and username content, we need to jump to the user interface through the username to continue grabbing the user’s points

Copy code The code is as follows:

var $ = cheerio.load(topicHtml);

//This URL is the target URL for the next step

var userHref = 'https://cnodejs.org' $('.reply_author').eq(0).attr('href');

userHref = url.resolve(tUrl, userHref);

var title = $('.topic_full_title').text().trim().replace(/n/g,"");;

var href = topicUrl;

var comment1 = $('.reply_content').eq(0).text().trim();

var author1 = $('.reply_author').eq(0).text().trim();

//Pass parameters to the next concurrent crawl

ep.emit('user_html', [userHref, title, href, comment1, author1]);

In eventproxy this time, we need to find where the score is placed (class="big").

{<2>}

It’s easy to find the classname. Let’s try to output the result first

Copy code The code is as follows:

var outcome = superagent.get(userUrl)

.end(function (err, res) {

If (err) {

                return console.error(err);

}

      var $ = cheerio.load(res.text);

      var score = $('.big').text().trim();

console.log(user[1]);

console.log(user[2]);

console.log(user[3]);

console.log(user[4]);

console.log($('.big').text().trim());

         return ({

              title: user[1],

             href: user[2],

              comment1: user[3],

              author1: user[4],

score1: score

        });

});

});

Run the program and get the result of this code.

{<3>}

But here comes the problem. We can correctly output the result in the callback function of .end(), but we cannot correctly output the outcome. If you look carefully, the outcome that needs to be output is a Request object. This is a careless mistake. The .end() function does not pass the return value to the Request object, and the result needs to be returned to the previous layer (users).

Copy code The code is as follows:

//find userDetails

ep.after('user_html', topicUrls.length, function(users){

    users = users.map(function(user){

        var userUrl = user[0];

        var score;

        superagent.get(userUrl)

            .end(function (err, res) {

                if (err) {

                    return console.error(err);

                }

                //console.log(res.text);

                var $ = cheerio.load(res.text);

                score = $('.big').text().trim();

            });

        return ({

            title: user[1],

            href: user[2],

            comment1: user[3],

            author1: user[4],

            score1: score

        });

    });

把users好好地输出发现除了score1其他是正确值。仔细调试发现，程序是先进行了console.log()，然后再进行.map()。更准确地说，在.map()函数内，.get()的回调函数并没有执行完赋值score，return 返回值就进行了。这就是回调函数的异步，而外层的同步操作是不会等待回调函数做完操作的。

{<4>}

我的做法就是eventproxy再emit一层消息，伴随着消息把需要的数据一起传递给接收消息操作.after()，只有当消息全部接收完毕，再打印出传递的参数(结果)。

复制代码代码如下:

score = $('.big')text().trim();

//新添加

ep.emit('got_score', [user[1], user[2], user[3], user[4], score]);

.....

ep.after('got_score', 10, function(users){

console.log(users);

});

{<6>}

这个问题解决了，但score1的数值好像太大了点吧。再一看，原来class='big'有两个，用户的话题收藏也是属于这个class。我们得通过cheerio的.slice( start, [end] )来切取第一个元素，即将score 修改为 score = $('.big').slice(0).eq(0).text().trim();。正确结果如图。

{<7>}