How to use nodeJs crawler-JS Tutorial-php.cn

Home

Web Front-end

JS Tutorial

How to use nodeJs crawler

php中世界最好的语言

May 30, 2018 am 09:56 AM

javascript nodejs use

This time I will show you how to use the nodeJs crawler, and what are the precautions for using the nodeJs crawler. The following is a practical case, let's take a look.

Background

Recently I plan to review the nodeJs-related content I have seen before, and write a few crawlers to kill the boredom, and I discovered some during the crawling process Questions, record them for future reference.

Dependencies

The cheerio library that is widely available on the Internet is used to process the crawled content, superagent is used to process requests, and log4js is used to record logs.

Log configuration

Without further ado, let’s go directly to the code:

const log4js = require('log4js');
log4js.configure({
 appenders: {
  cheese: {
   type: 'dateFile',
   filename: 'cheese.log',
   pattern: '-yyyy-MM-dd.log',
   // 包含模型
   alwaysIncludePattern: true,
   maxLogSize: 1024,
   backups: 3 }
 },
 categories: { default: { appenders: ['cheese'], level: 'info' } }
});
const logger = log4js.getLogger('cheese');
logger.level = 'INFO';
module.exports = logger;

Copy after login

The above directly exports a logger object and directly calls the logger in the business file. Just use .info() and other functions to add log information, and logs will be generated on a daily basis. There is a lot of relevant information on the Internet.

Crawling content and processing

 superagent.get(cityItemUrl).end((err, res) => {
  if (err) {
   return console.error(err);
  }
  const $ = cheerio.load(res.text);
  // 解析当前页面,获取当前页面的城市链接地址
  const cityInfoEle = $('.newslist1 li a');
  cityInfoEle.each((idx, element) => {
   const $element = $(element);
   const sceneURL = $element.attr('href'); // 页面地址
   const sceneName = $element.attr('title'); // 城市名称
   if (!sceneName) {
    return;
   }
   logger.info(`当前解析到的目的地是: ${sceneName}, 对应的地址为: ${sceneURL}`);
   getDesInfos(sceneURL, sceneName); // 获取城市详细信息
   ep.after('getDirInfoComplete', cityInfoEle.length, (dirInfos) => {
    const content = JSON.parse(fs.readFileSync(path.join(dirname, './imgs.json')));
    dirInfos.forEach((element) => {
     logger.info(`本条数据为:${JSON.stringify(element)}`);
     Object.assign(content, element);
    });
    fs.writeFileSync(path.join(dirname, './imgs.json'), JSON.stringify(content));
   });
  });
 });

Copy after login

Use superagent to request the page. After the request is successful, use cheerio to load the page content, and then use matching rules similar to Jquery to find the target resource. .

Multiple resources are loaded, use eventproxy to proxy events, process one resource and punish one event, and process the data after all events are triggered.

The above is the most basic crawler. Next are some areas that may cause problems or require special attention. . .

Read and write local files

Create folder

function mkdirSync(dirname) {
 if (fs.existsSync(dirname)) {
  return true;
 }
 if (mkdirSync(path.dirname(dirname))) {
  fs.mkdirSync(dirname);
  return true;
 }
 return false;
}

Copy after login

Read and write files

   const content = JSON.parse(fs.readFileSync(path.join(dirname, './dir.json')));
   dirInfos.forEach((element) => {
    logger.info(`本条数据为:${JSON.stringify(element)}`);
    Object.assign(content, element);
   });
   fs.writeFileSync(path.join(dirname, './dir.json'), JSON.stringify(content));

Copy after login

Batch download resources

Downloaded resources may include pictures, audio, etc.

Use Bagpipe to handle asynchronous concurrency. Refer to

const Bagpipe = require('bagpipe');
const bagpipe = new Bagpipe(10);
  bagpipe.push(downloadImage, url, dstpath, (err, data) => {
   if (err) {
    console.log(err);
    return;
   }
   console.log(`[${dstpath}]: ${data}`);
  });

Copy after login

to download resources and use stream to complete file writing.

function downloadImage(src, dest, callback) {
 request.head(src, (err, res, body) => {
  if (src && src.indexOf('http') > -1 || src.indexOf('https') > -1) {
   request(src).pipe(fs.createWriteStream(dest)).on('close', () => {
    callback(null, dest);
   });
  }
 });
}

Copy after login

Encoding

Sometimes the web page content processed directly using cheerio.load is found to be encoded text after writing to the file. You can use

const $ = cheerio.load(buf, { decodeEntities: false });

Copy after login

to disable encoding,

ps: The encoding library and iconv-lite failed to convert utf-8 encoded characters into Chinese. It may be that you are not familiar with the API. You can pay attention to it later.

Finally, attach a regular pattern that matches all dom tags

const reg = /<.*?>/g;

Copy after login

I believe you have mastered the method after reading the case in this article. For more exciting information, please pay attention to other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

How to fix KB5055612 fails to install in Windows 10?

4 weeks ago By DDD

Roblox: Bubble Gum Simulator Infinity - How To Get And Use Royal Keys

4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Roblox: Grow A Garden - Complete Mutation Guide

3 weeks ago By DDD

Nordhold: Fusion System, Explained

4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Mandragora: Whispers Of The Witch Tree - How To Unlock The Grappling Hook

3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Java Tutorial

1672

CakePHP Tutorial

1428

Laravel Tutorial

1332

PHP Tutorial

1276

C# Tutorial

1256

Related knowledge

BTCC tutorial: How to bind and use MetaMask wallet on BTCC exchange? Apr 26, 2024 am 09:40 AM

MetaMask (also called Little Fox Wallet in Chinese) is a free and well-received encryption wallet software. Currently, BTCC supports binding to the MetaMask wallet. After binding, you can use the MetaMask wallet to quickly log in, store value, buy coins, etc., and you can also get 20 USDT trial bonus for the first time binding. In the BTCCMetaMask wallet tutorial, we will introduce in detail how to register and use MetaMask, and how to bind and use the Little Fox wallet in BTCC. What is MetaMask wallet? With over 30 million users, MetaMask Little Fox Wallet is one of the most popular cryptocurrency wallets today. It is free to use and can be installed on the network as an extension

Is nodejs a backend framework? Apr 21, 2024 am 05:09 AM

Node.js can be used as a backend framework as it offers features such as high performance, scalability, cross-platform support, rich ecosystem, and ease of development.

What is the difference between npm and npm.cmd files in the nodejs installation directory? Apr 21, 2024 am 05:18 AM

There are two npm-related files in the Node.js installation directory: npm and npm.cmd. The differences are as follows: different extensions: npm is an executable file, and npm.cmd is a command window shortcut. Windows users: npm.cmd can be used from the command prompt, npm can only be run from the command line. Compatibility: npm.cmd is specific to Windows systems, npm is available cross-platform. Usage recommendations: Windows users use npm.cmd, other operating systems use npm.

What are the global variables in nodejs Apr 21, 2024 am 04:54 AM

The following global variables exist in Node.js: Global object: global Core module: process, console, require Runtime environment variables: __dirname, __filename, __line, __column Constants: undefined, null, NaN, Infinity, -Infinity

How to connect nodejs to mysql database Apr 21, 2024 am 06:13 AM

To connect to a MySQL database, you need to follow these steps: Install the mysql2 driver. Use mysql2.createConnection() to create a connection object that contains the host address, port, username, password, and database name. Use connection.query() to perform queries. Finally use connection.end() to end the connection.

Is nodejs a back-end development language? Apr 21, 2024 am 05:09 AM

Yes, Node.js is a backend development language. It is used for back-end development, including handling server-side business logic, managing database connections, and providing APIs.

Can nodejs write front-end? Apr 21, 2024 am 05:00 AM

Yes, Node.js can be used for front-end development, and key advantages include high performance, rich ecosystem, and cross-platform compatibility. Considerations to consider are learning curve, tool support, and small community size.

Is there a big difference between nodejs and java? Apr 21, 2024 am 06:12 AM

The main differences between Node.js and Java are design and features: Event-driven vs. thread-driven: Node.js is event-driven and Java is thread-driven. Single-threaded vs. multi-threaded: Node.js uses a single-threaded event loop, and Java uses a multi-threaded architecture. Runtime environment: Node.js runs on the V8 JavaScript engine, while Java runs on the JVM. Syntax: Node.js uses JavaScript syntax, while Java uses Java syntax. Purpose: Node.js is suitable for I/O-intensive tasks, while Java is suitable for large enterprise applications.

See all articles