Home Backend Development Python Tutorial Powerful web crawler system: pyspider

Powerful web crawler system: pyspider

May 12, 2017 am 10:35 AM
pyspider

PySpider: A powerful web crawler system written by a Chinese with a powerful WebUI. Written in Python language, distributed architecture, supports multiple database backends, and powerful WebUI supports script editor, task monitor, project manager and result viewer.

1. Build environment:

System version: Linux centos-linux.shared 3.10.0-123.el7.x86_64 #1 SMP Mon Jun 30 12:09 :22 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

Python version: Python 3.5.1

1.1. Build python3 environment:

After trying it, I chose the integrated environment Anaconda

1.1.1. Compile


# 下载依赖
yum install -y ncurses-devel openssl openssl-devel zlib-devel gcc make glibc-devel libffi-devel glibc-static glibc-utils sqlite-devel readline-devel tk-devel gdbm-devel db4-devel libpcap-devel xz-deve
# 下载python版本
wget https://www.python.org/ftp/python/3.5.1/Python-3.5.1.tgz
# 或者使用国内源
wget http://mirrors.sohu.com/python/3.5.1/Python-3.5.1.tgz
mv Python-3.5.1.tgz /usr/local/src;cd /usr/local/src
# 解压
tar -zxf Python-3.5.1.tgz;cd Python-3.5.1
# 编译安装
./configure --prefix=/usr/local/python3.5 --enable-shared
make && make install
# 建立软链接
ln -s /usr/local/python3.5/bin/python3 /usr/bin/python3
echo "/usr/local/python3.5/lib" > /etc/ld.so.conf.d/python3.5.conf
ldconfig
# 验证python3
python3
# Python 3.5.1 (default, Oct 9 2016, 11:44:24)
# [GCC 4.8.5 20150623 (Red Hat 4.8.5-4)] on linux
# Type "help", "copyright", "credits" or "license" for more information.
# >>>
# pip
/usr/local/python3.5/bin/pip3 install --upgrade pip
ln -s /usr/local/python3.5/bin/pip /usr/bin/pip
# 本人在安装时出现问题 将pip重装
wget https://bootstrap.pypa.io/get-pip.py --no-check-certificate
python get-pip.py
Copy after login

1.1.2. The integrated environment anaconda


# 集成环境anaconda(推荐)
wget https://repo.continuum.io/archive/Anaconda3-4.2.0-Linux-x86_64.sh
# 直接安装即可
./Anaconda3-4.2.0-Linux-x86_64.sh
# 若出错,可能是解压失败
yum install bzip2
Copy after login

1.2. Install mariaDB


# 安装
yum -y install mariadb mariadb-server
# 启动
systemctl start mariadb
# 设置为开机启动
systemctl enable mariadb
# 配置密码 默认为空
mysql_secure_installation
# 登录
mysql -u root -p
# 创建一个用户 自己设定账户密码
CREATE USER 'user_name'@'localhost' IDENTIFIED BY 'user_pass';
GRANT ALL PRIVILEGES ON *.* TO 'user_name'@'localhost' WITH GRANT OPTION;
CREATE USER 'user_name'@'%' IDENTIFIED BY 'user_pass';
GRANT ALL PRIVILEGES ON *.* TO 'user_name'@'%' WITH GRANT OPTION;
Copy after login

1.3. Install pyspider

I use Anaconda


# 搭建虚拟环境sbird python版本3.*
conda create -n sbird python=3*
# 进入环境
source activate sbird
# 安装pyspider
pip install pyspider
# 报错 
# it does not exist. The exported locale is "en_US.UTF-8" but it is not supported
# 执行 可写入.bashrc
export LC_ALL=en_US.utf-8
export LANG=en_US.utf-8
#ImportError: pycurl: libcurl link-time version (7.29.0) is older than compile-time version (7.49.0)
conda install pycurl
# 退出
source deactivate sbird
# 若在虚拟机内 出现无法访问localhost:5000 可关闭防火墙
systemctl stop firewalld.service
#########直接运行源码==============
mkdir git;cd git
# 下载
git clone https://github.com/binux/pyspider.git
# 安装
/root/anaconda3/envs/sbird/bin/python /root/git/pyspider/run.py
Copy after login

Other methods


# 搭建虚拟环境
pip install virtualenv
mkdir python;cd python
# 创建虚拟环境pyenv3
virtualenv -p /usr/bin/python3 pyenv3
# 进入虚拟环境 激活环境
cd pyenv3/
source ./bin/activate
pip install pyspider
# 若pycurl报错 
yum install libcurl-devel
# 继续
pip install pyspider
# 关闭
deactivate
Copy after login

I recommend using anaconda to install

If An error occurred during the running of pyspider. Please refer to the anaconda installation section. At this point, visit localhost:5000 to see the page.

1.4.Install Supervisor


##

# 安装
yum install supervisor -y
# 若无法检索 则添加阿里的epel源
vim /etc/yum.repos.d/epel.repo
# 添加以下内容
[epel]
name=Extra Packages for Enterprise Linux 7 - $basearch
baseurl=http://mirrors.aliyun.com/epel/7/$basearch
http://mirrors.aliyuncs.com/epel/7/$basearch
failovermethod=priority
enabled=1
gpgcheck=0
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-7

[epel-debuginfo]
name=Extra Packages for Enterprise Linux 7 - $basearch - Debug
baseurl=http://mirrors.aliyun.com/epel/7/$basearch/debug
http://mirrors.aliyuncs.com/epel/7/$basearch/debug
failovermethod=priority
enabled=0
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-7
gpgcheck=0

[epel-source]
name=Extra Packages for Enterprise Linux 7 - $basearch - Source
baseurl=http://mirrors.aliyun.com/epel/7/SRPMS
http://mirrors.aliyuncs.com/epel/7/SRPMS
failovermethod=priority
enabled=0
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-7
gpgcheck=0
# 安装
yum install supervisor -y
# 测试是否安装成功
echo_supervisord_conf
Copy after login

1.4.1.Supervisor usage


supervisord   #supervisor的服务器端部分 启动
supervisorctl  #启动supervisor的命令行窗口
# 假设创建进程pyspider01
vim /etc/supervisord.d/pyspider01.ini
# 写入以下内容
[program:pyspider01]

command   = /root/anaconda3/envs/sbird/bin/python /root/git/pyspider/run.py
directory  = /root/git/pyspider
user     = root
process_name = %(program_name)s
autostart  = true
autorestart = true
startsecs  = 3

redirect_stderr     = true
stdout_logfile_maxbytes = 500MB
stdout_logfile_backups = 10
stdout_logfile     = /pyspider/supervisor/pyspider01.log
# 重载
supervisorctl reload
# 启动
supervisorctl start pyspider01
# 也可这样启动
supervisord -c /etc/supervisord.conf
# 查看状态
supervisorctl status
# output 
pyspider01            RUNNING  pid 4026, uptime 0:02:40
# 关闭
supervisorctl shutdown
Copy after login

1.5. Installationredis


# 消息队列采用redis
mkdir download;cd download
wget http://download.redis.io/releases/redis-3.2.4.tar.gz
tar xzf redis-3.2.4.tar.gz
cd redis-3.2.4
make
# 或者直接yum安装
yum -y install redis
# 启动
systemctl start redis.service
# 重启
systemctl restart redis.service
# 停止
systemctl stop redis.service
# 查看状态
systemctl status redis.service
# 更改文件/etc/redis.conf
vim /etc/redis.conf
# 更改内容
daemonize no 改为 daemonize yes
bind 127.0.0.1 改为 bind 10.211.55.22(当前服务器ip)
# 重启redis
systemctl restart redis.service
Copy after login

1.6. About self-start


# Supervisor添加到自启动服务
systemctl enable supervisord.service
# redis添加到自启动服务
systemctl enable redis.service
# 关闭防火墙自启动
systemctl disable firewalld.service
Copy after login

At this point, the pyspider single server operating environment has been built and deployed. Start localhost:5000 to enter the web interface.

You can also write a script to run and check the running status in /pyspider/supervisor/pyspider01.log.

2. Distributed deployment

Name the server you just configured centos01. According to this configuration, deploy two centos02 and centos03 respectively.

As follows:

Server name ip description



centos01 10.211.55.22 redis,mariaDB, scheduler
centos02 10.211.55.23 fetcher, processor, result_worker,phantomjs
centos03 10.211.55.24 fetcher, processor,,result_worker,webui
Copy after login

2.1.centos01

Enter server centos01, After the first step, the basic environment has been set up. First edit the

configuration file /pyspider/config.json

##

{
 "taskdb": "mysql+taskdb://user_name:user_pass@10.211.55.22:3306/taskdb",
 "projectdb": "mysql+projectdb://user_name:user_pass@10.211.55.22:3306/projectdb",
 "resultdb": "mysql+resultdb://user_name:user_pass@10.211.55.22:3306/resultdb",
 "message_queue": "redis://10.211.55.22:6379/db",
 "logging-config": "/pyspider/logging.conf",
 "phantomjs-proxy":"10.211.55.23:25555",
 "webui": {
  "username": "",
  "password": "",
  "need-auth": false,
  "host":"10.211.55.24",
  "port":"5000",
  "scheduler-rpc":"http:// 10.211.55.22:5002",
  "fetcher-rpc":"http://10.211.55.23:5001"
 },
 "fetcher": {
  "xmlrpc":true,
  "xmlrpc-host": "0.0.0.0",
  "xmlrpc-port": "5001"
 },
 "scheduler": {
  "xmlrpc":true,
  "xmlrpc-host": "0.0.0.0",
  "xmlrpc-port": "5002"
 }
}
Copy after login

and try to run:

/root/anaconda3/envs/sbird/bin/python /root/git/pyspider/run.py -c /pyspider/config.json scheduler
# 报错
ImportError: No module named 'mysql'
# 下载 mysql-connector-python
cd ~/git/
git clone https://github.com/mysql/mysql-connector-python.git
# 安装
source activate sbird
cd mysql-connector-python
python setup.py install
# 安装redis
pip install redis
source deactivate
# 运行
/root/anaconda3/envs/sbird/bin/python /root/git/pyspider/run.py -c /pyspider/config.json scheduler
# 输出 ok
[I 161010 15:57:25 scheduler:644] scheduler starting...
[I 161010 15:57:25 scheduler:779] scheduler.xmlrpc listening on 0.0.0.0:5002
[I 161010 15:57:25 scheduler:583] in 5m: new:0,success:0,retry:0,failed:0
Copy after login

After successful operation, you can directly change /etc/supervisord.d/pyspider01.ini as follows:

[program:pyspider01]

command   = /root/anaconda3/envs/sbird/bin/python /root/git/pyspider/run.py -c /pyspider/config.json scheduler
directory  = /root/git/pyspider
user     = root
process_name = %(program_name)s
autostart  = true
autorestart = true
startsecs  = 3

redirect_stderr     = true
stdout_logfile_maxbytes = 500MB
stdout_logfile_backups = 10
stdout_logfile     = /pyspider/supervisor/pyspider01.log
# 重载
supervisorctl reload
# 查看状态
supervisorctl status
Copy after login

centos01 deployment complete.

2.2.centos02

In centos02, you need to run result_worker, processor, phantomjs, and fetcher

to create files respectively:

/etc/supervisord.d/result_worker.ini

[program:result_worker]

command   = /root/anaconda3/envs/sbird/bin/python /root/git/pyspider/run.py -c /pyspider/config.json result_worker
directory  = /root/git/pyspider
user     = root
process_name = %(program_name)s
autostart  = true
autorestart = true
startsecs  = 3

redirect_stderr     = true
stdout_logfile_maxbytes = 500MB
stdout_logfile_backups = 10
stdout_logfile     = /pyspider/supervisor/result_worker.log
/etc/supervisord.d/processor.ini

[program:processor]

command   = /root/anaconda3/envs/sbird/bin/python /root/git/pyspider/run.py -c /pyspider/config.json processor
directory  = /root/git/pyspider
user     = root
process_name = %(program_name)s
autostart  = true
autorestart = true
startsecs  = 3

redirect_stderr     = true
stdout_logfile_maxbytes = 500MB
stdout_logfile_backups = 10
stdout_logfile     = /pyspider/supervisor/processor.log
/etc/supervisord.d/phantomjs.ini

[program:phantomjs]

command   = /pyspider/phantomjs --config=/pyspider/pjsconfig.json /pyspider/phantomjs_fetcher.js 25555
directory  = /root/git/pyspider
user     = root
process_name = %(program_name)s
autostart  = true
autorestart = true
startsecs  = 3

redirect_stderr     = true
stdout_logfile_maxbytes = 500MB
stdout_logfile_backups = 10
stdout_logfile     = /pyspider/supervisor/phantomjs.log
/etc/supervisord.d/fetcher.ini

[program:fetcher]

command   = /root/anaconda3/envs/sbird/bin/python /root/git/pyspider/run.py -c /pyspider/config.json fetcher
directory  = /root/git/pyspider
user     = root
process_name = %(program_name)s
autostart  = true
autorestart = true
startsecs  = 3

redirect_stderr     = true
stdout_logfile_maxbytes = 500MB
stdout_logfile_backups = 10
stdout_logfile     = /pyspider/supervisor/fetcher.log
Copy after login

Create pjsconfig.json in the pyspider directory

{
 /*--ignore-ssl-errors=true */
 "ignoreSslErrors": true,

 /*--ssl-protocol=true */
 "sslprotocol": "any",

 /* Same as: --output-encoding=utf8 */
 "outputEncoding": "utf8",

 /* persistent Cookies. */
 /*cookiesfile="e:/phontjscookies.txt",*/
 cookiesfile="pyspider/phontjscookies.txt",

 /* load image */
 autoLoadImages = false
}
Copy after login

Download phantomjs to the /pyspider/ folder and add git/pyspider/pyspider/ fetcher/phantomjs_fetcher.js is copied to phantomjs_fetcher.js

##
# 重载
supervisorctl reload
# 查看状态
supervisorctl status
# output
fetcher             RUNNING  pid 3446, uptime 0:00:07
phantomjs            RUNNING  pid 3448, uptime 0:00:07
processor            RUNNING  pid 3447, uptime 0:00:07
result_worker          RUNNING  pid 3445, uptime 0:00:07
Copy after login

centos02 is deployed.

2.3.centos03

The deployment of these three processes fetcher, processor, result_worker is the same as centos02. This server mainly adds webui on the basis of the previous ones.

Create file:

/etc/supervisord.d/webui.ini

[program:webui]

command   = /root/anaconda3/envs/sbird/bin/python /root/git/pyspider/run.py -c /pyspider/config.json webui
directory  = /root/git/pyspider
user     = root
process_name = %(program_name)s
autostart  = true
autorestart = true
startsecs  = 3

redirect_stderr     = true
stdout_logfile_maxbytes = 500MB
stdout_logfile_backups = 10
stdout_logfile     = /pyspider/supervisor/webui.log
# 重载
supervisorctl reload
# 查看状态
supervisorctl status
# output
fetcher             RUNNING  pid 2724, uptime 0:00:07
processor            RUNNING  pid 2725, uptime 0:00:07
result_worker          RUNNING  pid 2723, uptime 0:00:07
webui              RUNNING  pid 2726, uptime 0:00:07
Copy after login

3. Summary

[Related recommendations]

1.
Python Free video tutorial

2. Python learning manual

3. Python object-oriented video tutorial

The above is the detailed content of Powerful web crawler system: pyspider. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

How to build 4 virtual disks to build a distributed MinIO cluster in Linux? How to build 4 virtual disks to build a distributed MinIO cluster in Linux? Feb 10, 2024 pm 04:48 PM

Since I have recently started to be responsible for the construction and stability operation and maintenance of object storage-related systems, as a novice in "object storage", I need to strengthen my learning in this area. Since the company currently uses MinIO to build the company's object storage system, I will gradually share my learning experience about MinIO in the future. Everyone is welcome to continue to pay attention. This article mainly introduces how to set up MinIO in a test environment, which is also the most basic step in building a MinIO learning environment. 1. Prepare the experimental environment using OracleVMVirtualBox virtual machine, install a minimum version of Linux, and then add 4 virtual disks to serve as MinIO virtual disks. The experimental environment is as follows: Next, let me briefly introduce it to you

CentOS7 various version image download addresses and version descriptions (including Everything version) CentOS7 various version image download addresses and version descriptions (including Everything version) Feb 29, 2024 am 09:20 AM

When loading CentOS-7.0-1406, there are many optional versions. For ordinary users, they don’t know which one to choose. Here is a brief introduction: (1) CentOS-xxxx-LiveCD.ios and CentOS-xxxx- What is the difference between bin-DVD.iso? The former only has 700M, and the latter has 3.8G. The difference is not only in size, but the more essential difference is that CentOS-xxxx-LiveCD.ios can only be loaded into the memory and run, and cannot be installed. Only CentOS-xxx-bin-DVD1.iso can be installed on the hard disk. (2) CentOS-xxx-bin-DVD1.iso, Ce

Steps to enter CentOS 7 emergency repair mode Steps to enter CentOS 7 emergency repair mode Jan 13, 2024 am 09:36 AM

Open the centos7 page and appear: welcome to emergency mode! afterloggingin, type "journalctl -xb" to viewsystemlogs, "systemctlreboot" toreboot, "systemctldefault" to tryagaintobootintodefaultmode. giverootpasswordformaintenance(??Control-D???): Solution: execute r

How to access and clean junk files in /tmp directory in CentOS 7? How to access and clean junk files in /tmp directory in CentOS 7? Dec 27, 2023 pm 09:10 PM

There is a lot of garbage in the tmp directory in the centos7 system. If you want to clear the garbage, how should you do it? Let’s take a look at the detailed tutorial below. To view the list of files in the tmp file directory, execute the command cdtmp/ to switch to the current file directory of tmp, and execute the ll command to view the list of files in the current directory. As shown below. Use the rm command to delete files. It should be noted that the rm command deletes files from the system forever. Therefore, it is recommended that when using the rm command, it is best to give a prompt before deleting the file. Use the command rm-i file name, wait for the user to confirm deletion (y) or skip deletion (n), and the system will perform corresponding operations. As shown below.

How to set password rules in centos7? How to set password rules in centos7 How to set password rules in centos7? How to set password rules in centos7 Jan 07, 2024 pm 01:17 PM

Set password rules for security reasons Set the number of days after which passwords expire. User must change password within days. This setting only affects created users, not existing users. If setting to an existing user, run the command "chage -M (days) (user)". PASS_MAX_DAYS60#Password expiration time PASS_MIN_DAYS3#Initial password change time PASS_MIN_LEN8#Minimum password length PASS_WARN_AGE7#Password expiration prompt time Repeat password restriction use [root@linuxprobe~]#vi/etc/pam.d/system-auth#nearline15:

How to install mbstring extension under CENTOS7? How to install mbstring extension under CENTOS7? Jan 06, 2024 pm 09:59 PM

1.UncaughtError:Calltoundefinedfunctionmb_strlen(); When the above error occurs, it means that we have not installed the mbstring extension; 2. Enter the PHP installation directory cd/temp001/php-7.1.0/ext/mbstring 3. Start phpize(/usr/local/bin /phpize or /usr/local/php7-abel001/bin/phpize) command to install php extension 4../configure--with-php-config=/usr/local/php7-abel

How to install Mysql in CentOS7 and set it to start automatically at boot How to install Mysql in CentOS7 and set it to start automatically at boot Jun 02, 2023 pm 08:36 PM

Centos7 does not have a mysql database. The default database is mariadb (a branch of mysql). You can install the mysql database manually by following the steps below. 1. Download the rpm installation file wgethttp://repo.mysql.com/mysql-community-release-el7.rpm 2. Execute rpm to install rpm-ivhmysql-community-release-el7.rpm. After the dependency resolution is completed, the following options appear: dependenciesresolved =================================

Detailed explanation of decompression file command (zip) under centos7 Detailed explanation of decompression file command (zip) under centos7 Jan 07, 2024 pm 06:30 PM

1. The compressed folder is a zip file [root@cgls]#zip-rmydata.zipmydata2. Unzip mydata.zip into the mydatabak directory [root@cgls]#unzipmydata.zip-dmydatabak3.mydata01 folder and mydata02.txt are compressed into mydata.zip[root@cgls]#zipmydata.zipmydata01mydata02.txt4. Decompress the mydata.zip file directly [root@cgls]#unzipmydata.zip5. View myd

See all articles