Home Web Front-end JS Tutorial Let's talk about how to use third-party open source libraries to implement website crawling function in node

Let's talk about how to use third-party open source libraries to implement website crawling function in node

Dec 17, 2021 pm 07:11 PM
node

This article will introduce to you how to easily implement the website crawling function with the help of third-party open source libraries in node. I hope it will be helpful to everyone!

Let's talk about how to use third-party open source libraries to implement website crawling function in node

nodejsImplementing website crawling function

Introduction to third-party libraries

  • request encapsulation of network requests

  • cheerio node version of jQuery

  • mkdirp creates multiple layers Folder directory

Implementation ideas

  • Through request Get the content of the specified url

  • ThroughcheerioFind the jump path in the page (remove duplicates)

  • ThroughmkdirpCreate directory

  • Create a file through fs and write the read content to

  • Get Repeat the above steps to the path that is not accessed

Code implementation

const fs = require("fs");
const path = require("path");
const request = require("request");
const cheerio = require("cheerio");
const mkdirp = require("mkdirp");
// 定义入口url
const homeUrl = "https://www.baidu.com";
// 定义set存储已经访问过的路径,避免重复访问
const set = new Set([homeUrl]);
function grab(url) {
  // 校验url规范性
  if (!url) return;
  // 去空格
  url = url.trim();
  // 自动补全url路径
  if (url.endsWith("/")) {
    url += "index.html";
  }
  const chunks = [];
  // url可能存在一些符号或者中文,可以通过encodeURI编码
  request(encodeURI(url))
    .on("error", (e) => {
      // 打印错误信息
      console.log(e);
    })
    .on("data", (chunk) => {
      // 接收响应内容
      chunks.push(chunk);
    })
    .on("end", () => {
      // 将相应内容转换成文本
      const html = Buffer.concat(chunks).toString();
      // 没有获取到内容
      if (!html) return;
      // 解析url
      let { host, origin, pathname } = new URL(url);
      pathname = decodeURI(pathname);
      // 通过cheerio解析html
      const $ = cheerio.load(html);
      // 将路径作为目录
      const dir = path.dirname(pathname);
      // 创建目录
      mkdirp.sync(path.join(__dirname, dir));
      // 往文件写入内容
      fs.writeFile(path.join(__dirname, pathname), html, "utf-8", (err) => {
        // 打印错误信息
        if (err) {
          console.log(err);
          return;
        }
        console.log(`[${url}]保存成功`);
      });
      // 获取到页面中所有a元素
      const aTags = $("a");
      Array.from(aTags).forEach((aTag) => {
        // 获取到a标签中的路径
        const href = $(aTag).attr("href");
        // 此处可以校验href的合法或者控制爬去的网站范围,比如必须都是某个域名下的
        // 排除空标签
        if (!href) return;
        // 排除锚点连接
        if (href.startsWith("#")) return;
        if (href.startsWith("mailto:")) return;
        // 如果不想要保存图片可以过滤掉
        // if (/\.(jpg|jpeg|png|gif|bit)$/.test(href)) return;
        // href必须是入口url域名
        let reg = new RegExp(`^https?:\/\/${host}`);
        if (/^https?:\/\//.test(href) && !reg.test(href)) return;
        // 可以根据情况增加更多逻辑
        let newUrl = "";
        if (/^https?:\/\//.test(href)) {
          // 处理绝对路径
          newUrl = href;
        } else {
          // 处理相对路径
          newUrl = origin + path.join(dir, href);
        }
        // 判断是否访问过
        if (set.has(newUrl)) return;
        if (newUrl.endsWith("/") && set.has(newUrl + "index.html")) return;
        if (newUrl.endsWith("/")) newUrl += "index.html";
        set.add(newUrl);
        grab(newUrl);
      });
    });
}
// 开始抓取
grab(homeUrl);
Copy after login

Summary

The simple web crawler is complete. You can try changing homeUrl to the website you want to crawl.

For more node-related knowledge, please visit: nodejs tutorial! !

The above is the detailed content of Let's talk about how to use third-party open source libraries to implement website crawling function in node. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

How to delete node in nvm How to delete node in nvm Dec 29, 2022 am 10:07 AM

How to delete node with nvm: 1. Download "nvm-setup.zip" and install it on the C drive; 2. Configure environment variables and check the version number through the "nvm -v" command; 3. Use the "nvm install" command Install node; 4. Delete the installed node through the "nvm uninstall" command.

How to use express to handle file upload in node project How to use express to handle file upload in node project Mar 28, 2023 pm 07:28 PM

How to handle file upload? The following article will introduce to you how to use express to handle file uploads in the node project. I hope it will be helpful to you!

How to do Docker mirroring of Node service? Detailed explanation of extreme optimization How to do Docker mirroring of Node service? Detailed explanation of extreme optimization Oct 19, 2022 pm 07:38 PM

During this period, I was developing a HTML dynamic service that is common to all categories of Tencent documents. In order to facilitate the generation and deployment of access to various categories, and to follow the trend of cloud migration, I considered using Docker to fix service content and manage product versions in a unified manner. . This article will share the optimization experience I accumulated in the process of serving Docker for your reference.

An in-depth analysis of Node's process management tool 'pm2” An in-depth analysis of Node's process management tool 'pm2” Apr 03, 2023 pm 06:02 PM

This article will share with you Node's process management tool "pm2", and talk about why pm2 is needed, how to install and use pm2, I hope it will be helpful to everyone!

Pi Node Teaching: What is a Pi Node? How to install and set up Pi Node? Pi Node Teaching: What is a Pi Node? How to install and set up Pi Node? Mar 05, 2025 pm 05:57 PM

Detailed explanation and installation guide for PiNetwork nodes This article will introduce the PiNetwork ecosystem in detail - Pi nodes, a key role in the PiNetwork ecosystem, and provide complete steps for installation and configuration. After the launch of the PiNetwork blockchain test network, Pi nodes have become an important part of many pioneers actively participating in the testing, preparing for the upcoming main network release. If you don’t know PiNetwork yet, please refer to what is Picoin? What is the price for listing? Pi usage, mining and security analysis. What is PiNetwork? The PiNetwork project started in 2019 and owns its exclusive cryptocurrency Pi Coin. The project aims to create a one that everyone can participate

Let's talk about how to use pkg to package Node.js projects into executable files. Let's talk about how to use pkg to package Node.js projects into executable files. Dec 02, 2022 pm 09:06 PM

How to package nodejs executable file with pkg? The following article will introduce to you how to use pkg to package a Node project into an executable file. I hope it will be helpful to you!

What to do if npm node gyp fails What to do if npm node gyp fails Dec 29, 2022 pm 02:42 PM

npm node gyp fails because "node-gyp.js" does not match the version of "Node.js". The solution is: 1. Clear the node cache through "npm cache clean -f"; 2. Through "npm install -g n" Install the n module; 3. Install the "node v12.21.0" version through the "n v12.21.0" command.

Token-based authentication with Angular and Node Token-based authentication with Angular and Node Sep 01, 2023 pm 02:01 PM

Authentication is one of the most important parts of any web application. This tutorial discusses token-based authentication systems and how they differ from traditional login systems. By the end of this tutorial, you will see a fully working demo written in Angular and Node.js. Traditional Authentication Systems Before moving on to token-based authentication systems, let’s take a look at traditional authentication systems. The user provides their username and password in the login form and clicks Login. After making the request, authenticate the user on the backend by querying the database. If the request is valid, a session is created using the user information obtained from the database, and the session information is returned in the response header so that the session ID is stored in the browser. Provides access to applications subject to

See all articles