This article mainly introduces the crawler function implemented by node, and analyzes the steps and related operating techniques of nodejs to implement the crawler function in the form of examples. Friends in need can refer to it
The examples in this article describe the implementation of node Crawler function. Share it with everyone for your reference, the details are as follows:
node is a server-side language, so you can crawl the website like python. Next, we will use node to crawl the blog park and get all the chapter information. .
Step one: Create the crawl file and then npm init.
Step 2: Create the crawl.js file. A simple code to crawl the entire page is as follows:
var http = require("http"); var url = "http://www.cnblogs.com"; http.get(url, function (res) { var html = ""; res.on("data", function (data) { html += data; }); res.on("end", function () { console.log(html); }); }).on("error", function () { console.log("获取课程结果错误!"); });
Introduce the http module, and then use the get request of the http object. That is, once it runs, it is equivalent to the node server sending a get request to request this page, and then returning it through res, in which the on binding data event is used to continuously Accept the data and print it out in the background at the end.
This is just a part of the entire page. We can inspect the elements on this page and find that they are indeed the same.
We only need to crawl the chapter title and the information of each section. .
The third step: Introduce the cheerio module as follows: (Just install it in gitbash, cmd always has problems)
cnpm install cheerio --save-dev
This module is introduced to facilitate our operation of dom, just like jQuery.
Step 4: Operate dom and obtain useful information.
var http = require("http"); var cheerio = require("cheerio"); var url = "http://www.cnblogs.com"; function filterData(html) { var $ = cheerio.load(html); var items = $(".post_item"); var result = []; items.each(function (item) { var tit = $(this).find(".titlelnk").text(); var aut = $(this).find(".lightblue").text(); var one = { title: tit, author: aut }; result.push(one); }); return result; } function printInfos(allInfos) { allInfos.forEach(function (item) { console.log("文章题目 " + item["title"] + '\n' + "文章作者 " + item["author"] + '\n'+ '\n'); }); } http.get(url, function (res) { var html = ""; res.on("data", function (data) { html += data; }); res.on("end", function (data) { var allInfos = filterData(html); printInfos(allInfos); }); }).on("error", function () { console.log("爬取博客园首页失败") });
That is, the above process is crawling the title and author of the blog.
The final background output is as follows:
This is consistent with the content of the blog homepage:
Related recommendations:
Node implements static resource server
node implements token-based authentication
The above is the detailed content of The crawler function implemented by node. For more information, please follow other related articles on the PHP Chinese website!