This time I will bring you a detailed explanation of the steps for using the nodeJs crawler. What are the precautions when using the nodeJs crawler? Here are practical cases, let’s take a look.
Background
Recently I plan to review the nodeJs-related content I have seen before, and write a few crawlers to kill the boredom, and I discovered some during the crawling process Questions, record them for future reference.
Dependencies
The cheerio library that is widely available on the Internet is used to process the crawled content, superagent is used to process requests, and log4js is used to record logs.
Log configuration
Without further ado, let’s go directly to the code:
const log4js = require('log4js'); log4js.configure({ appenders: { cheese: { type: 'dateFile', filename: 'cheese.log', pattern: '-yyyy-MM-dd.log', // 包含模型 alwaysIncludePattern: true, maxLogSize: 1024, backups: 3 } }, categories: { default: { appenders: ['cheese'], level: 'info' } } }); const logger = log4js.getLogger('cheese'); logger.level = 'INFO'; module.exports = logger;
The above directly exports a logger object and directly calls the logger in the business file. Just use .info() and other functions to add log information, and logs will be generated on a daily basis. There is a lot of relevant information on the Internet.
Crawling content and processing
superagent.get(cityItemUrl).end((err, res) => { if (err) { return console.error(err); } const $ = cheerio.load(res.text); // 解析当前页面,获取当前页面的城市链接地址 const cityInfoEle = $('.newslist1 li a'); cityInfoEle.each((idx, element) => { const $element = $(element); const sceneURL = $element.attr('href'); // 页面地址 const sceneName = $element.attr('title'); // 城市名称 if (!sceneName) { return; } logger.info(`当前解析到的目的地是: ${sceneName}, 对应的地址为: ${sceneURL}`); getDesInfos(sceneURL, sceneName); // 获取城市详细信息 ep.after('getDirInfoComplete', cityInfoEle.length, (dirInfos) => { const content = JSON.parse(fs.readFileSync(path.join(dirname, './imgs.json'))); dirInfos.forEach((element) => { logger.info(`本条数据为:${JSON.stringify(element)}`); Object.assign(content, element); }); fs.writeFileSync(path.join(dirname, './imgs.json'), JSON.stringify(content)); }); }); });
Use superagent to request the page. After the request is successful, use cheerio to load the page content, and then use matching rules similar to Jquery to find the target resource. .
Multiple resources are loaded, use eventproxy to proxy events, process one resource and punish one event, and process the data after all events are triggered.
The above is the most basic crawler. Next are some areas that may cause problems or require special attention. . .
Read and write local files
Create folder
function mkdirSync(dirname) { if (fs.existsSync(dirname)) { return true; } if (mkdirSync(path.dirname(dirname))) { fs.mkdirSync(dirname); return true; } return false; }
Read and write files
const content = JSON.parse(fs.readFileSync(path.join(dirname, './dir.json'))); dirInfos.forEach((element) => { logger.info(`本条数据为:${JSON.stringify(element)}`); Object.assign(content, element); }); fs.writeFileSync(path.join(dirname, './dir.json'), JSON.stringify(content));
Batch download resources
Downloaded resources may include pictures, audio, etc.
Use Bagpipe to handle asynchronous concurrency. Refer to
const Bagpipe = require('bagpipe'); const bagpipe = new Bagpipe(10); bagpipe.push(downloadImage, url, dstpath, (err, data) => { if (err) { console.log(err); return; } console.log(`[${dstpath}]: ${data}`); });
to download resources and use stream to complete file writing.
function downloadImage(src, dest, callback) { request.head(src, (err, res, body) => { if (src && src.indexOf('http') > -1 || src.indexOf('https') > -1) { request(src).pipe(fs.createWriteStream(dest)).on('close', () => { callback(null, dest); }); } }); }
Encoding
Sometimes the web page content processed directly using cheerio.load is found to be encoded text after writing to the file. You can use
const $ = cheerio.load(buf, { decodeEntities: false });
to disable encoding,
ps: The encoding library and iconv-lite failed to convert utf-8 encoded characters into Chinese. It may be that you are not familiar with the API. You can pay attention to it later.
Finally, attach a regular pattern that matches all dom tags
const reg = /<.*?>/g;
I believe you have mastered the method after reading the case in this article. For more exciting information, please pay attention to other related articles on the PHP Chinese website!
Recommended reading:
Detailed explanation of how to use jQuery class name selector (.class)
js encapsulates ajax function function Detailed explanation of implementation steps
The above is the detailed content of Detailed explanation of steps to use nodeJs crawler. For more information, please follow other related articles on the PHP Chinese website!