Powerful web crawler system: pyspider
PySpider: A powerful web crawler system written by a Chinese with a powerful WebUI. Written in Python language, distributed architecture, supports multiple database backends, and powerful WebUI supports script editor, task monitor, project manager and result viewer.
1. Build environment:
System version: Linux centos-linux.shared 3.10.0-123.el7.x86_64 #1 SMP Mon Jun 30 12:09 :22 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
Python version: Python 3.5.1
1.1. Build python3 environment:
After trying it, I chose the integrated environment Anaconda
1.1.1. Compile
# 下载依赖 yum install -y ncurses-devel openssl openssl-devel zlib-devel gcc make glibc-devel libffi-devel glibc-static glibc-utils sqlite-devel readline-devel tk-devel gdbm-devel db4-devel libpcap-devel xz-deve # 下载python版本 wget https://www.python.org/ftp/python/3.5.1/Python-3.5.1.tgz # 或者使用国内源 wget http://mirrors.sohu.com/python/3.5.1/Python-3.5.1.tgz mv Python-3.5.1.tgz /usr/local/src;cd /usr/local/src # 解压 tar -zxf Python-3.5.1.tgz;cd Python-3.5.1 # 编译安装 ./configure --prefix=/usr/local/python3.5 --enable-shared make && make install # 建立软链接 ln -s /usr/local/python3.5/bin/python3 /usr/bin/python3 echo "/usr/local/python3.5/lib" > /etc/ld.so.conf.d/python3.5.conf ldconfig # 验证python3 python3 # Python 3.5.1 (default, Oct 9 2016, 11:44:24) # [GCC 4.8.5 20150623 (Red Hat 4.8.5-4)] on linux # Type "help", "copyright", "credits" or "license" for more information. # >>> # pip /usr/local/python3.5/bin/pip3 install --upgrade pip ln -s /usr/local/python3.5/bin/pip /usr/bin/pip # 本人在安装时出现问题 将pip重装 wget https://bootstrap.pypa.io/get-pip.py --no-check-certificate python get-pip.py
1.1.2. The integrated environment anaconda
# 集成环境anaconda(推荐) wget https://repo.continuum.io/archive/Anaconda3-4.2.0-Linux-x86_64.sh # 直接安装即可 ./Anaconda3-4.2.0-Linux-x86_64.sh # 若出错,可能是解压失败 yum install bzip2
1.2. Install mariaDB
# 安装 yum -y install mariadb mariadb-server # 启动 systemctl start mariadb # 设置为开机启动 systemctl enable mariadb # 配置密码 默认为空 mysql_secure_installation # 登录 mysql -u root -p # 创建一个用户 自己设定账户密码 CREATE USER 'user_name'@'localhost' IDENTIFIED BY 'user_pass'; GRANT ALL PRIVILEGES ON *.* TO 'user_name'@'localhost' WITH GRANT OPTION; CREATE USER 'user_name'@'%' IDENTIFIED BY 'user_pass'; GRANT ALL PRIVILEGES ON *.* TO 'user_name'@'%' WITH GRANT OPTION;
1.3. Install pyspider
I use Anaconda
# 搭建虚拟环境sbird python版本3.* conda create -n sbird python=3* # 进入环境 source activate sbird # 安装pyspider pip install pyspider # 报错 # it does not exist. The exported locale is "en_US.UTF-8" but it is not supported # 执行 可写入.bashrc export LC_ALL=en_US.utf-8 export LANG=en_US.utf-8 #ImportError: pycurl: libcurl link-time version (7.29.0) is older than compile-time version (7.49.0) conda install pycurl # 退出 source deactivate sbird # 若在虚拟机内 出现无法访问localhost:5000 可关闭防火墙 systemctl stop firewalld.service #########直接运行源码============== mkdir git;cd git # 下载 git clone https://github.com/binux/pyspider.git # 安装 /root/anaconda3/envs/sbird/bin/python /root/git/pyspider/run.py
Other methods
# 搭建虚拟环境 pip install virtualenv mkdir python;cd python # 创建虚拟环境pyenv3 virtualenv -p /usr/bin/python3 pyenv3 # 进入虚拟环境 激活环境 cd pyenv3/ source ./bin/activate pip install pyspider # 若pycurl报错 yum install libcurl-devel # 继续 pip install pyspider # 关闭 deactivate
I recommend using anaconda to install
If An error occurred during the running of pyspider. Please refer to the anaconda installation section. At this point, visit localhost:5000 to see the page.
1.4.Install Supervisor
##
# 安装 yum install supervisor -y # 若无法检索 则添加阿里的epel源 vim /etc/yum.repos.d/epel.repo # 添加以下内容 [epel] name=Extra Packages for Enterprise Linux 7 - $basearch baseurl=http://mirrors.aliyun.com/epel/7/$basearch http://mirrors.aliyuncs.com/epel/7/$basearch failovermethod=priority enabled=1 gpgcheck=0 gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-7 [epel-debuginfo] name=Extra Packages for Enterprise Linux 7 - $basearch - Debug baseurl=http://mirrors.aliyun.com/epel/7/$basearch/debug http://mirrors.aliyuncs.com/epel/7/$basearch/debug failovermethod=priority enabled=0 gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-7 gpgcheck=0 [epel-source] name=Extra Packages for Enterprise Linux 7 - $basearch - Source baseurl=http://mirrors.aliyun.com/epel/7/SRPMS http://mirrors.aliyuncs.com/epel/7/SRPMS failovermethod=priority enabled=0 gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-7 gpgcheck=0 # 安装 yum install supervisor -y # 测试是否安装成功 echo_supervisord_conf
supervisord #supervisor的服务器端部分 启动 supervisorctl #启动supervisor的命令行窗口 # 假设创建进程pyspider01 vim /etc/supervisord.d/pyspider01.ini # 写入以下内容 [program:pyspider01] command = /root/anaconda3/envs/sbird/bin/python /root/git/pyspider/run.py directory = /root/git/pyspider user = root process_name = %(program_name)s autostart = true autorestart = true startsecs = 3 redirect_stderr = true stdout_logfile_maxbytes = 500MB stdout_logfile_backups = 10 stdout_logfile = /pyspider/supervisor/pyspider01.log # 重载 supervisorctl reload # 启动 supervisorctl start pyspider01 # 也可这样启动 supervisord -c /etc/supervisord.conf # 查看状态 supervisorctl status # output pyspider01 RUNNING pid 4026, uptime 0:02:40 # 关闭 supervisorctl shutdown
# 消息队列采用redis mkdir download;cd download wget http://download.redis.io/releases/redis-3.2.4.tar.gz tar xzf redis-3.2.4.tar.gz cd redis-3.2.4 make # 或者直接yum安装 yum -y install redis # 启动 systemctl start redis.service # 重启 systemctl restart redis.service # 停止 systemctl stop redis.service # 查看状态 systemctl status redis.service # 更改文件/etc/redis.conf vim /etc/redis.conf # 更改内容 daemonize no 改为 daemonize yes bind 127.0.0.1 改为 bind 10.211.55.22(当前服务器ip) # 重启redis systemctl restart redis.service
1.6. About self-start
# Supervisor添加到自启动服务 systemctl enable supervisord.service # redis添加到自启动服务 systemctl enable redis.service # 关闭防火墙自启动 systemctl disable firewalld.service
2. Distributed deployment
Name the server you just configured centos01. According to this configuration, deploy two centos02 and centos03 respectively. As follows:Server name ip descriptioncentos01 10.211.55.22 redis,mariaDB, scheduler centos02 10.211.55.23 fetcher, processor, result_worker,phantomjs centos03 10.211.55.24 fetcher, processor,,result_worker,webui
configuration file /pyspider/config.json
##
{ "taskdb": "mysql+taskdb://user_name:user_pass@10.211.55.22:3306/taskdb", "projectdb": "mysql+projectdb://user_name:user_pass@10.211.55.22:3306/projectdb", "resultdb": "mysql+resultdb://user_name:user_pass@10.211.55.22:3306/resultdb", "message_queue": "redis://10.211.55.22:6379/db", "logging-config": "/pyspider/logging.conf", "phantomjs-proxy":"10.211.55.23:25555", "webui": { "username": "", "password": "", "need-auth": false, "host":"10.211.55.24", "port":"5000", "scheduler-rpc":"http:// 10.211.55.22:5002", "fetcher-rpc":"http://10.211.55.23:5001" }, "fetcher": { "xmlrpc":true, "xmlrpc-host": "0.0.0.0", "xmlrpc-port": "5001" }, "scheduler": { "xmlrpc":true, "xmlrpc-host": "0.0.0.0", "xmlrpc-port": "5002" } }
and try to run:
/root/anaconda3/envs/sbird/bin/python /root/git/pyspider/run.py -c /pyspider/config.json scheduler # 报错 ImportError: No module named 'mysql' # 下载 mysql-connector-python cd ~/git/ git clone https://github.com/mysql/mysql-connector-python.git # 安装 source activate sbird cd mysql-connector-python python setup.py install # 安装redis pip install redis source deactivate # 运行 /root/anaconda3/envs/sbird/bin/python /root/git/pyspider/run.py -c /pyspider/config.json scheduler # 输出 ok [I 161010 15:57:25 scheduler:644] scheduler starting... [I 161010 15:57:25 scheduler:779] scheduler.xmlrpc listening on 0.0.0.0:5002 [I 161010 15:57:25 scheduler:583] in 5m: new:0,success:0,retry:0,failed:0
After successful operation, you can directly change /etc/supervisord.d/pyspider01.ini as follows:
[program:pyspider01] command = /root/anaconda3/envs/sbird/bin/python /root/git/pyspider/run.py -c /pyspider/config.json scheduler directory = /root/git/pyspider user = root process_name = %(program_name)s autostart = true autorestart = true startsecs = 3 redirect_stderr = true stdout_logfile_maxbytes = 500MB stdout_logfile_backups = 10 stdout_logfile = /pyspider/supervisor/pyspider01.log # 重载 supervisorctl reload # 查看状态 supervisorctl status
centos01 deployment complete. 2.2.centos02
In centos02, you need to run result_worker, processor, phantomjs, and fetcher
to create files respectively:
/etc/supervisord.d/result_worker.ini [program:result_worker] command = /root/anaconda3/envs/sbird/bin/python /root/git/pyspider/run.py -c /pyspider/config.json result_worker directory = /root/git/pyspider user = root process_name = %(program_name)s autostart = true autorestart = true startsecs = 3 redirect_stderr = true stdout_logfile_maxbytes = 500MB stdout_logfile_backups = 10 stdout_logfile = /pyspider/supervisor/result_worker.log /etc/supervisord.d/processor.ini [program:processor] command = /root/anaconda3/envs/sbird/bin/python /root/git/pyspider/run.py -c /pyspider/config.json processor directory = /root/git/pyspider user = root process_name = %(program_name)s autostart = true autorestart = true startsecs = 3 redirect_stderr = true stdout_logfile_maxbytes = 500MB stdout_logfile_backups = 10 stdout_logfile = /pyspider/supervisor/processor.log /etc/supervisord.d/phantomjs.ini [program:phantomjs] command = /pyspider/phantomjs --config=/pyspider/pjsconfig.json /pyspider/phantomjs_fetcher.js 25555 directory = /root/git/pyspider user = root process_name = %(program_name)s autostart = true autorestart = true startsecs = 3 redirect_stderr = true stdout_logfile_maxbytes = 500MB stdout_logfile_backups = 10 stdout_logfile = /pyspider/supervisor/phantomjs.log /etc/supervisord.d/fetcher.ini [program:fetcher] command = /root/anaconda3/envs/sbird/bin/python /root/git/pyspider/run.py -c /pyspider/config.json fetcher directory = /root/git/pyspider user = root process_name = %(program_name)s autostart = true autorestart = true startsecs = 3 redirect_stderr = true stdout_logfile_maxbytes = 500MB stdout_logfile_backups = 10 stdout_logfile = /pyspider/supervisor/fetcher.log
Create pjsconfig.json in the pyspider directory
{ /*--ignore-ssl-errors=true */ "ignoreSslErrors": true, /*--ssl-protocol=true */ "sslprotocol": "any", /* Same as: --output-encoding=utf8 */ "outputEncoding": "utf8", /* persistent Cookies. */ /*cookiesfile="e:/phontjscookies.txt",*/ cookiesfile="pyspider/phontjscookies.txt", /* load image */ autoLoadImages = false }
Download phantomjs to the /pyspider/ folder and add git/pyspider/pyspider/ fetcher/phantomjs_fetcher.js is copied to phantomjs_fetcher.js##
# 重载 supervisorctl reload # 查看状态 supervisorctl status # output fetcher RUNNING pid 3446, uptime 0:00:07 phantomjs RUNNING pid 3448, uptime 0:00:07 processor RUNNING pid 3447, uptime 0:00:07 result_worker RUNNING pid 3445, uptime 0:00:07
centos02 is deployed.
2.3.centos03
The deployment of these three processes fetcher, processor, result_worker is the same as centos02. This server mainly adds webui on the basis of the previous ones.
Create file:/etc/supervisord.d/webui.ini [program:webui] command = /root/anaconda3/envs/sbird/bin/python /root/git/pyspider/run.py -c /pyspider/config.json webui directory = /root/git/pyspider user = root process_name = %(program_name)s autostart = true autorestart = true startsecs = 3 redirect_stderr = true stdout_logfile_maxbytes = 500MB stdout_logfile_backups = 10 stdout_logfile = /pyspider/supervisor/webui.log # 重载 supervisorctl reload # 查看状态 supervisorctl status # output fetcher RUNNING pid 2724, uptime 0:00:07 processor RUNNING pid 2725, uptime 0:00:07 result_worker RUNNING pid 2723, uptime 0:00:07 webui RUNNING pid 2726, uptime 0:00:07
3. Summary
1.
Python Free video tutorial
2. Python learning manual
3. Python object-oriented video tutorial
The above is the detailed content of Powerful web crawler system: pyspider. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics



Since I have recently started to be responsible for the construction and stability operation and maintenance of object storage-related systems, as a novice in "object storage", I need to strengthen my learning in this area. Since the company currently uses MinIO to build the company's object storage system, I will gradually share my learning experience about MinIO in the future. Everyone is welcome to continue to pay attention. This article mainly introduces how to set up MinIO in a test environment, which is also the most basic step in building a MinIO learning environment. 1. Prepare the experimental environment using OracleVMVirtualBox virtual machine, install a minimum version of Linux, and then add 4 virtual disks to serve as MinIO virtual disks. The experimental environment is as follows: Next, let me briefly introduce it to you

When loading CentOS-7.0-1406, there are many optional versions. For ordinary users, they don’t know which one to choose. Here is a brief introduction: (1) CentOS-xxxx-LiveCD.ios and CentOS-xxxx- What is the difference between bin-DVD.iso? The former only has 700M, and the latter has 3.8G. The difference is not only in size, but the more essential difference is that CentOS-xxxx-LiveCD.ios can only be loaded into the memory and run, and cannot be installed. Only CentOS-xxx-bin-DVD1.iso can be installed on the hard disk. (2) CentOS-xxx-bin-DVD1.iso, Ce

Open the centos7 page and appear: welcome to emergency mode! afterloggingin, type "journalctl -xb" to viewsystemlogs, "systemctlreboot" toreboot, "systemctldefault" to tryagaintobootintodefaultmode. giverootpasswordformaintenance(??Control-D???): Solution: execute r

There is a lot of garbage in the tmp directory in the centos7 system. If you want to clear the garbage, how should you do it? Let’s take a look at the detailed tutorial below. To view the list of files in the tmp file directory, execute the command cdtmp/ to switch to the current file directory of tmp, and execute the ll command to view the list of files in the current directory. As shown below. Use the rm command to delete files. It should be noted that the rm command deletes files from the system forever. Therefore, it is recommended that when using the rm command, it is best to give a prompt before deleting the file. Use the command rm-i file name, wait for the user to confirm deletion (y) or skip deletion (n), and the system will perform corresponding operations. As shown below.

Set password rules for security reasons Set the number of days after which passwords expire. User must change password within days. This setting only affects created users, not existing users. If setting to an existing user, run the command "chage -M (days) (user)". PASS_MAX_DAYS60#Password expiration time PASS_MIN_DAYS3#Initial password change time PASS_MIN_LEN8#Minimum password length PASS_WARN_AGE7#Password expiration prompt time Repeat password restriction use [root@linuxprobe~]#vi/etc/pam.d/system-auth#nearline15:

1.UncaughtError:Calltoundefinedfunctionmb_strlen(); When the above error occurs, it means that we have not installed the mbstring extension; 2. Enter the PHP installation directory cd/temp001/php-7.1.0/ext/mbstring 3. Start phpize(/usr/local/bin /phpize or /usr/local/php7-abel001/bin/phpize) command to install php extension 4../configure--with-php-config=/usr/local/php7-abel

Centos7 does not have a mysql database. The default database is mariadb (a branch of mysql). You can install the mysql database manually by following the steps below. 1. Download the rpm installation file wgethttp://repo.mysql.com/mysql-community-release-el7.rpm 2. Execute rpm to install rpm-ivhmysql-community-release-el7.rpm. After the dependency resolution is completed, the following options appear: dependenciesresolved =================================

1. The compressed folder is a zip file [root@cgls]#zip-rmydata.zipmydata2. Unzip mydata.zip into the mydatabak directory [root@cgls]#unzipmydata.zip-dmydatabak3.mydata01 folder and mydata02.txt are compressed into mydata.zip[root@cgls]#zipmydata.zipmydata01mydata02.txt4. Decompress the mydata.zip file directly [root@cgls]#unzipmydata.zip5. View myd
