The content of this article is about how node crawls images from web pages (with code). It has certain reference value. Friends in need can refer to it. I hope it will be helpful to you.
Install node and download dependencies
Build service
Request the page we want to crawl and return json
We start to install node, you can go to the node official website to download https://nodejs. org/zh-cn/, run node after downloading,
node -v
After successful installation, the version number you installed will appear.
Next we use node to print out hello world, create a new file named index.js and enter
console.log('hello world')
Run this file
node index.js
and it will be output on the control panel hello world
Create a new folder named node.
First you need to download the express dependency
npm install express
Then create a new file named demo.js with the directory structure as shown below:
In demo.js introduces the downloaded express
const express = require('express'); const app = express(); app.get('/index', function(req, res) { res.end('111') }) var server = app.listen(8081, function() { var host = server.address().address var port = server.address().port console.log("应用实例,访问地址为 http://%s:%s", host, port) })
Run node demo.js and set up a simple service, as shown in the figure:
Request the page we want to crawl
npm install superagent npm install superagent-charset npm install cheerio
superagent is used to initiate requests. It is a lightweight, progressive ajax API with good readability, low learning curve, and internal dependence on nodejs native Request api, suitable for nodejs environment. You can also use http to initiate a request
superagent-charset to prevent crawled data from being garbled and change the character format
cheerio is specially customized for the server, fast , flexible and implemented jQuery core implementation. After installing the dependencies, you can introduce them
var superagent = require('superagent'); var charset = require('superagent-charset'); charset(superagent); const cheerio = require('cheerio');
After importing, request our address, https://www.qqtn.com/tx/weixintx_1.html, as shown in the picture:
Declare the address variable:
const baseUrl = 'https://www.qqtn.com/'
After these settings are completed, the request is sent. Next, please see the complete code demo.js
var superagent = require('superagent'); var charset = require('superagent-charset'); charset(superagent); var express = require('express'); var baseUrl = 'https://www.qqtn.com/'; //输入任何网址都可以 const cheerio = require('cheerio'); var app = express(); app.get('/index', function(req, res) { //设置请求头 res.header("Access-Control-Allow-Origin", "*"); res.header('Access-Control-Allow-Methods', 'PUT, GET, POST, DELETE, OPTIONS'); res.header("Access-Control-Allow-Headers", "X-Requested-With"); res.header('Access-Control-Allow-Headers', 'Content-Type'); //类型 var type = req.query.type; //页码 var page = req.query.page; type = type || 'weixin'; page = page || '1'; var route = `tx/${type}tx_${page}.html` //网页页面信息是gb2312,所以chaeset应该为.charset('gb2312'),一般网页则为utf-8,可以直接使用.charset('utf-8') superagent.get(baseUrl + route) .charset('gb2312') .end(function(err, sres) { var items = []; if (err) { console.log('ERR: ' + err); res.json({ code: 400, msg: err, sets: items }); return; } var $ = cheerio.load(sres.text); $('div.g-main-bg ul.g-gxlist-imgbox li a').each(function(idx, element) { var $element = $(element); var $subElement = $element.find('img'); var thumbImgSrc = $subElement.attr('src'); items.push({ title: $(element).attr('title'), href: $element.attr('href'), thumbSrc: thumbImgSrc }); }); res.json({ code: 200, msg: "", data: items }); }); }); var server = app.listen(8081, function() { var host = server.address().address var port = server.address().port console.log("应用实例,访问地址为 http://%s:%s", host, port) })
Running demo.js will return us The data obtained is as shown in the figure:
#A simple node crawler is completed.
Related recommendations:
node crawler gbk web page Chinese garbled solution_html/css_WEB-ITnose
node download Sample code sharing of http small crawler
The above is the detailed content of How node crawls images from web pages (code attached). For more information, please follow other related articles on the PHP Chinese website!