A powerful crawler based on Node.js that can directly publish crawled articles

Home

Web Front-end

JS Tutorial

A powerful crawler based on Node.js that can directly publish crawled articles_node.js

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

May 16, 2016 pm 03:20 PM

node.js reptile

1. Environment configuration

1) Build a server, any Linux will do, I use CentOS 6.5;

2) Install a mysql database, either 5.5 or 5.6. You can install it directly with lnmp or lamp to save trouble. You can also read the logs directly in the browser later;

3) First install a node.js environment. I am using 0.12.7. I have not tried later versions;

4) Execute npm -g install forever to install forever so that the crawler can run in the background;

5) Organize all the code locally (integration = git clone);

6) Execute npm install in the project directory to install dependent libraries;

7) Create two empty folders, json and avatar, in the project directory;

8) Create an empty mysql database and a user with full permissions, execute setup.sql and startusers.sql in the code successively, create the database structure and import the initial seed user;

9) Edit config.js, the configuration items marked (required) must be filled in or modified, and the remaining items can be left unchanged for the time being:

exports.jsonPath = "./json/";//生成json文件的路径
exports.avatarPath = "./avatar/";//保存头像文件的路径
exports.dbconfig = {
  host: 'localhost',//数据库服务器（必须）
  user: 'dbuser',//数据库用户名（必须）
  password: 'dbpassword',//数据库密码（必须）
  database: 'dbname',//数据库名（必须）
  port: 3306,//数据库服务器端口
  poolSize: 20,
  acquireTimeout: 30000
};
  
exports.urlpre = "http://www.jb51.net/";//脚本网址
exports.urlzhuanlanpre = "http://www.jb51.net/list/index_96.htm/";//脚本网址
  
exports.WPurl = "www.xxx.com";//要发布文章的wordpress网站地址
exports.WPusername = "publishuser";//发布文章的用户名
exports.WPpassword = "publishpassword";//发布文章用户的密码
exports.WPurlavatarpre = "http://www.xxx.com/avatar/";//发布文章中替代原始头像的url地址
  
exports.mailservice = "QQ";//邮件通知服务类型，也可以用Gmail，前提是你访问得了Gmail（必须）
exports.mailuser = "12345@qq.com";//邮箱用户名（必须）
exports.mailpass = "qqpassword";//邮箱密码（必须）
exports.mailfrom = "12345@qq.com";//发送邮件地址（必须，一般与用户名所属邮箱一致）
exports.mailto = "12345@qq.com";//接收通知邮件地址（必须）

Copy after login

Save and proceed to the next step.

2. Crawler users

The principle of the crawler is actually to simulate a real Zhihu user clicking around on the website and collecting data, so we need to have a real Zhihu user. For testing, you can use your own account, but for long-term reasons, it is better to register a special account. One is enough, and the current crawler only supports one. Our simulation process does not have to log in from the homepage like a real user, but directly borrows the cookie value:

After registering, activating and logging in, go to your homepage, use any browser with developer mode or cookie plug-in, and open your own cookies in Zhihu. There may be a very complex list, but we only need a part of it, namely "z_c0". Copy the z_c0 part of your own cookie, leaving out the equal signs, quotation marks, and semicolons. The final format is roughly like this:

z_c0="LA8kJIJFdDSOA883wkUGJIRE8jVNKSOQfB9430=|1420113988|a6ea18bc1b23ea469e3b5fb2e33c2828439cb";

Copy after login

Insert a row of records in the cookies table of the mysql database, where the field values are:

email: The login email of the crawler user
password: the password of the crawler user
name: crawler username
hash: The hash of the crawler user (a unique identifier that cannot be modified by each user. In fact, it is not used here and can be left blank temporarily)
cookie: the cookie you copied just now

Then it can officially start running. If the cookie expires or the user is blocked, just modify the cookie field in this row of records.

3. Operation

It is recommended to use forever to execute, which not only facilitates background running and logging, but also automatically restarts after a crash. Example:

forever -l /var/www/log.txt index.js

Copy after login

The address after -l is where the log is recorded. If it is placed in the web server directory, it can be accessed in the browser through http://www.xxx.com/log.txt Check the log directly. Add parameters (separated by spaces) after index.js to execute different crawler instructions:
1. -i executes immediately. If this parameter is not added, it will be executed at the next specified time by default, such as 0:05 every morning;
2. -ng skips the phase of fetching new users, that is, getnewuser;
3. -ns skips the snapshot phase, that is, usersnapshot;
4. -nf skips the data file generation stage, that is, saveviewfile;
5. -db displays debugging logs.
The functions of each stage are introduced in the next section. In order to facilitate the operation, you can write this line of command as an sh script, for example:

#!/bin/bash
cd /usr/zhihuspider
rm -f /var/www/log.txt
forever -l /var/www/log.txt start index.js $*

Copy after login

Please replace the specific path with your own. In this way, you can start the crawler by adding parameters to ./zhihuspider.sh: For example, ./zhihuspider.sh -i -ng -nf starts the task immediately and skips the new user and file saving stages. The method to stop the crawler is forever stopall (or stop sequence number).

4. Overview of principles

See that the entry file for Zhihu crawler is index.js. It executes crawler tasks at specified times every day in a loop. There are three tasks that are executed sequentially every day, namely:

1) getnewuser.js: Capture new user information by comparing the list of user followers in the current library. Relying on this mechanism, you can automatically list the worthy users on Zhihu New people are added to the library;

2) usersnapshot.js: Loops to capture user information and answer lists in the current library, and save them in the form of daily snapshots.

3) saveviewfile.js: Generate a user analysis list based on the content of the latest snapshot, and filter out yesterday, recent and historical essence answers and publish them to the "Kanzhihu" website .

After the above three tasks are completed, the main thread will refresh the Zhihu homepage every few minutes to verify whether the current cookie is still valid. If it is invalid (jumping to the non-login page), a notification email will be sent to the specified mailbox. , remind you to change cookies in time. The method of changing cookies is the same as during initialization. You only need to log in manually once and then take out the cookie value. If you are interested in the specific code implementation, you can carefully read the comments inside, adjust some configurations, or even try to reconstruct the entire crawler yourself.

Tips

1) The principle of getnewuser is to specify the capture by comparing the number of users' followings in the snapshots of the two days before and after, so it must have at least two snapshots before it can be started. Even if it is executed before, it will be automatically skipped.

2) Half of the snapshot can be restored. If the program crashes due to an error, use forever stop to stop it, and then add the parameters -i -ng to execute it immediately and skip the new user phase, so that you can continue from the half-captured snapshot.

3) Do not easily increase the number of (pseudo) threads when taking snapshots, that is, the maxthreadcount attribute in usersnapshots. Too many threads will cause 429 errors, and the large amount of data captured may not be written to the database in time, causing memory overflow. Therefore, unless your database is on an SSD, do not exceed 10 threads.

4) The work of savingviewfile to generate analysis results requires snapshots of at least the past 7 days. If the snapshot content is less than 7 days old, an error will be reported and skipped. Previous analysis work can be performed by manually querying the database.

5) Considering that most people do not need to copy a "Kanzhihu", the entry to the automatic publishing WordPress article function has been commented out. If you have set up WordPress, remember to enable xmlrpc, then set up a user specifically for publishing articles, configure the corresponding parameters in config.js and uncomment the relevant code in saveviewfile.

6) Since Zhihu has implemented anti-leeching treatment for avatars, we also obtained the avatars when capturing user information and saved them locally. When publishing articles, we used the local avatar address. You need to point the URL path in the http server to the folder where the avatar is saved, or place the folder where the avatar is saved directly in the website directory.

7) The code may not be easy to read. In addition to the fact that the callback structure of node.js itself is quite confusing, part of the reason is that when I first wrote the program, I had just started to come into contact with node.js. There were many unfamiliar places that caused the structure to be confusing and I didn’t have time to correct it; another part was that after many times There are many ugly judgment conditions and retry rules accumulated in the patchwork. If they are all removed, the code volume may be reduced by two-thirds. But there is no way around it. In order to ensure the stable operation of a system, these must be added.

8) This crawler source code is based on the WTFPL protocol and does not impose any restrictions on modification and release.

The above is the entire content of this article, I hope it will be helpful to everyone’s study.

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

R.E.P.O. Best Graphic Settings

3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Assassin's Creed Shadows: Seashell Riddle Solution

1 weeks ago By DDD

R.E.P.O. How to Fix Audio if You Can't Hear Anyone

3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Where to find the Crane Control Keycard in Atomfall

1 weeks ago By DDD

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7433

CakePHP Tutorial

1359

What is the format of the account name of steam

win11 activation key permanent

Related knowledge

How long does it take to learn python crawler Oct 25, 2023 am 09:44 AM

The time it takes to learn Python crawlers varies from person to person and depends on factors such as personal learning ability, learning methods, learning time and experience. Learning Python crawlers is not just about learning the technology itself, but also requires good information gathering skills, problem solving skills and teamwork skills. Through continuous learning and practice, you will gradually grow into an excellent Python crawler developer.

Analysis and solutions to common problems of PHP crawlers Aug 06, 2023 pm 12:57 PM

Analysis of common problems and solutions for PHP crawlers Introduction: With the rapid development of the Internet, the acquisition of network data has become an important link in various fields. As a widely used scripting language, PHP has powerful capabilities in data acquisition. One of the commonly used technologies is crawlers. However, in the process of developing and using PHP crawlers, we often encounter some problems. This article will analyze and give solutions to these problems and provide corresponding code examples. 1. Description of the problem that the data of the target web page cannot be correctly parsed.

Crawler Tips: How to Handle Cookies in PHP Jun 13, 2023 pm 02:54 PM

In crawler development, handling cookies is often an essential part. As a state management mechanism in HTTP, cookies are usually used to record user login information and behavior. They are the key for crawlers to handle user authentication and maintain login status. In PHP crawler development, handling cookies requires mastering some skills and paying attention to some pitfalls. Below we explain in detail how to handle cookies in PHP. 1. How to get Cookie when writing in PHP

Efficient Java crawler practice: sharing of web data crawling techniques Jan 09, 2024 pm 12:29 PM

Java crawler practice: How to efficiently crawl web page data Introduction: With the rapid development of the Internet, a large amount of valuable data is stored in various web pages. To obtain this data, it is often necessary to manually access each web page and extract the information one by one, which is undoubtedly a tedious and time-consuming task. In order to solve this problem, people have developed various crawler tools, among which Java crawler is one of the most commonly used. This article will lead readers to understand how to use Java to write an efficient web crawler, and demonstrate the practice through specific code examples. 1. The base of the reptile

Efficiently crawl web page data: combined use of PHP and Selenium Jun 15, 2023 pm 08:36 PM

With the rapid development of Internet technology, Web applications are increasingly used in our daily work and life. In the process of web application development, crawling web page data is a very important task. Although there are many web scraping tools on the market, these tools are not very efficient. In order to improve the efficiency of web page data crawling, we can use the combination of PHP and Selenium. First, we need to understand what PHP and Selenium are. PHP is a powerful

Tutorial on using PHP to crawl Douban movie reviews Jun 14, 2023 pm 05:06 PM

As the film market continues to expand and develop, people's demand for films is also getting higher and higher. As for movie evaluation, Douban Film Critics has always been a more authoritative and popular choice. Sometimes, we also need to perform certain analysis and processing on Douban film reviews, which requires using crawler technology to obtain information about Douban film reviews. This article will introduce a tutorial on how to use PHP to crawl Douban movie reviews for your reference. Obtain the page address of Douban movies. Before crawling Douban movie reviews, you need to obtain the page address of Douban movies. OK

PHP practice: crawling Bilibili barrage data Jun 13, 2023 pm 07:08 PM

Bilibili is a popular barrage video website in China. It is also a treasure trove, containing all kinds of data. Among them, barrage data is a very valuable resource, so many data analysts and researchers hope to obtain this data. In this article, I will introduce the use of PHP language to crawl Bilibili barrage data. Preparation work Before starting to crawl barrage data, we need to install a PHP crawler framework Symphony2. You can enter through the following command

Practical crawler practice: using PHP to crawl stock information Jun 13, 2023 pm 05:32 PM

The stock market has always been a topic of great concern. The daily rise, fall and changes in stocks directly affect investors' decisions. If you want to understand the latest developments in the stock market, you need to obtain and analyze stock information in a timely manner. The traditional method is to manually open major financial websites to view stock data one by one. This method is obviously too cumbersome and inefficient. At this time, crawlers have become a very efficient and automated solution. Next, we will demonstrate how to use PHP to write a simple stock crawler program to obtain stock data. allow

See all articles