1. Environment configuration
1) Build a server, any Linux will do, I use CentOS 6.5;
2) Install a mysql database, either 5.5 or 5.6. You can install it directly with lnmp or lamp to save trouble. You can also read the logs directly in the browser later;
3) First install a node.js environment. I am using 0.12.7. I have not tried later versions;
4) Execute npm -g install forever to install forever so that the crawler can run in the background;
5) Organize all the code locally (integration = git clone);
6) Execute npm install in the project directory to install dependent libraries;
7) Create two empty folders, json and avatar, in the project directory;
8) Create an empty mysql database and a user with full permissions, execute setup.sql and startusers.sql in the code successively, create the database structure and import the initial seed user;
9) Edit config.js, the configuration items marked (required) must be filled in or modified, and the remaining items can be left unchanged for the time being:
exports.jsonPath = "./json/";//生成json文件的路径 exports.avatarPath = "./avatar/";//保存头像文件的路径 exports.dbconfig = { host: 'localhost',//数据库服务器(必须) user: 'dbuser',//数据库用户名(必须) password: 'dbpassword',//数据库密码(必须) database: 'dbname',//数据库名(必须) port: 3306,//数据库服务器端口 poolSize: 20, acquireTimeout: 30000 }; exports.urlpre = "http://www.jb51.net/";//脚本网址 exports.urlzhuanlanpre = "http://www.jb51.net/list/index_96.htm/";//脚本网址 exports.WPurl = "www.xxx.com";//要发布文章的wordpress网站地址 exports.WPusername = "publishuser";//发布文章的用户名 exports.WPpassword = "publishpassword";//发布文章用户的密码 exports.WPurlavatarpre = "http://www.xxx.com/avatar/";//发布文章中替代原始头像的url地址 exports.mailservice = "QQ";//邮件通知服务类型,也可以用Gmail,前提是你访问得了Gmail(必须) exports.mailuser = "12345@qq.com";//邮箱用户名(必须) exports.mailpass = "qqpassword";//邮箱密码(必须) exports.mailfrom = "12345@qq.com";//发送邮件地址(必须,一般与用户名所属邮箱一致) exports.mailto = "12345@qq.com";//接收通知邮件地址(必须)
Save and proceed to the next step.
2. Crawler users
The principle of the crawler is actually to simulate a real Zhihu user clicking around on the website and collecting data, so we need to have a real Zhihu user. For testing, you can use your own account, but for long-term reasons, it is better to register a special account. One is enough, and the current crawler only supports one. Our simulation process does not have to log in from the homepage like a real user, but directly borrows the cookie value:
After registering, activating and logging in, go to your homepage, use any browser with developer mode or cookie plug-in, and open your own cookies in Zhihu. There may be a very complex list, but we only need a part of it, namely "z_c0". Copy the z_c0 part of your own cookie, leaving out the equal signs, quotation marks, and semicolons. The final format is roughly like this:
z_c0="LA8kJIJFdDSOA883wkUGJIRE8jVNKSOQfB9430=|1420113988|a6ea18bc1b23ea469e3b5fb2e33c2828439cb";
Insert a row of records in the cookies table of the mysql database, where the field values are:
Then it can officially start running. If the cookie expires or the user is blocked, just modify the cookie field in this row of records.
3. Operation
It is recommended to use forever to execute, which not only facilitates background running and logging, but also automatically restarts after a crash. Example:
forever -l /var/www/log.txt index.js
The address after -l is where the log is recorded. If it is placed in the web server directory, it can be accessed in the browser through http://www.xxx.com/log.txt Check the log directly. Add parameters (separated by spaces) after index.js to execute different crawler instructions:
1. -i executes immediately. If this parameter is not added, it will be executed at the next specified time by default, such as 0:05 every morning;
2. -ng skips the phase of fetching new users, that is, getnewuser;
3. -ns skips the snapshot phase, that is, usersnapshot;
4. -nf skips the data file generation stage, that is, saveviewfile;
5. -db displays debugging logs.
The functions of each stage are introduced in the next section. In order to facilitate the operation, you can write this line of command as an sh script, for example:
#!/bin/bash cd /usr/zhihuspider rm -f /var/www/log.txt forever -l /var/www/log.txt start index.js $*
Please replace the specific path with your own. In this way, you can start the crawler by adding parameters to ./zhihuspider.sh: For example, ./zhihuspider.sh -i -ng -nf starts the task immediately and skips the new user and file saving stages. The method to stop the crawler is forever stopall (or stop sequence number).
4. Overview of principles
See that the entry file for Zhihu crawler is index.js. It executes crawler tasks at specified times every day in a loop. There are three tasks that are executed sequentially every day, namely:
1) getnewuser.js: Capture new user information by comparing the list of user followers in the current library. Relying on this mechanism, you can automatically list the worthy users on Zhihu New people are added to the library;
2) usersnapshot.js: Loops to capture user information and answer lists in the current library, and save them in the form of daily snapshots.
3) saveviewfile.js: Generate a user analysis list based on the content of the latest snapshot, and filter out yesterday, recent and historical essence answers and publish them to the "Kanzhihu" website .
After the above three tasks are completed, the main thread will refresh the Zhihu homepage every few minutes to verify whether the current cookie is still valid. If it is invalid (jumping to the non-login page), a notification email will be sent to the specified mailbox. , remind you to change cookies in time. The method of changing cookies is the same as during initialization. You only need to log in manually once and then take out the cookie value. If you are interested in the specific code implementation, you can carefully read the comments inside, adjust some configurations, or even try to reconstruct the entire crawler yourself.
Tips
1) The principle of getnewuser is to specify the capture by comparing the number of users' followings in the snapshots of the two days before and after, so it must have at least two snapshots before it can be started. Even if it is executed before, it will be automatically skipped.
2) Half of the snapshot can be restored. If the program crashes due to an error, use forever stop to stop it, and then add the parameters -i -ng to execute it immediately and skip the new user phase, so that you can continue from the half-captured snapshot.
3) Do not easily increase the number of (pseudo) threads when taking snapshots, that is, the maxthreadcount attribute in usersnapshots. Too many threads will cause 429 errors, and the large amount of data captured may not be written to the database in time, causing memory overflow. Therefore, unless your database is on an SSD, do not exceed 10 threads.
4) The work of savingviewfile to generate analysis results requires snapshots of at least the past 7 days. If the snapshot content is less than 7 days old, an error will be reported and skipped. Previous analysis work can be performed by manually querying the database.
5) Considering that most people do not need to copy a "Kanzhihu", the entry to the automatic publishing WordPress article function has been commented out. If you have set up WordPress, remember to enable xmlrpc, then set up a user specifically for publishing articles, configure the corresponding parameters in config.js and uncomment the relevant code in saveviewfile.
6) Since Zhihu has implemented anti-leeching treatment for avatars, we also obtained the avatars when capturing user information and saved them locally. When publishing articles, we used the local avatar address. You need to point the URL path in the http server to the folder where the avatar is saved, or place the folder where the avatar is saved directly in the website directory.
7) The code may not be easy to read. In addition to the fact that the callback structure of node.js itself is quite confusing, part of the reason is that when I first wrote the program, I had just started to come into contact with node.js. There were many unfamiliar places that caused the structure to be confusing and I didn’t have time to correct it; another part was that after many times There are many ugly judgment conditions and retry rules accumulated in the patchwork. If they are all removed, the code volume may be reduced by two-thirds. But there is no way around it. In order to ensure the stable operation of a system, these must be added.
8) This crawler source code is based on the WTFPL protocol and does not impose any restrictions on modification and release.
The above is the entire content of this article, I hope it will be helpful to everyone’s study.