1 master, 2 slaves, under the virtual machine ubuntu, the configuration is as follows:
master’s config.json
:
{
"taskdb": "mysql+taskdb://pyspider:pyspider-pass@192.168.209.128:3306/taskdb",
"projectdb": "mysql+projectdb://pyspider:pyspider-pass@192.168.209.128:3306/projectdb",
"resultdb": "mysql+resultdb://pyspider:pyspider-pass@192.168.209.128:3306/resultdb",
"message_queue": "redis://192.168.209.128:6379/db",
"phantomjs-proxy": "192.168.209.128:25555",
"scheduler":{
"xmlrpc-host":"0.0.0.0",
"delete-time":10},
"webui": {
"port": 5555,
"username": "",
"password": "",
"need-auth": false}
}
Run on host
/usr/local/bin/pyspider -c /home/pu/pyspider/conf.json schedule
/usr/local/bin/pyspider -c /home/pu/pyspider/conf.json webui
/usr/local/bin/pyspider -c /home/pu/pyspider/conf.json phantomjs
slave’s config.json
:
{
"taskdb": "mysql+taskdb://pyspider:pyspider-pass@192.168.209.128:3306/taskdb",
"projectdb": "mysql+projectdb://pyspider:pyspider-pass@192.168.209.128:3306/projectdb",
"resultdb": "mysql+resultdb://pyspider:pyspider-pass@192.168.209.128:3306/resultdb",
"message_queue": "redis://192.168.209.128:6379/db",
"phantomjs-proxy": "192.168.209.128:25555",
"fetcher":{"xmlrpc-host":"192.168.209.128"}
}
Run on two slave machines
/usr/local/bin/pyspider -c /home/pu/pyspider/conf.json fetcher
/usr/local/bin/pyspider -c /home/pu/pyspider/conf.json processor
/usr/local/bin/pyspider -c /home/pu/pyspider/conf.json result_worker
Three terminals
I ran it from the command line first, and I haven’t used Supervisor to manage the process. I want to use this management process after the distributed debugging is successful. The command line just opens a few more terminals. But it's strange. The crawler program can run smoothly, but the time taken to run it on a single machine is the same as when running three machines together. It's just a few seconds different. Can you please explain?
I looked at the information output by the terminal. It is that the URLs extracted by the two slaves are not repeated, but the time is separated by intervals. For example, slave1 runs for 4 seconds, and then slave2 runs for 3 seconds. They are not parallel. It's in order, so strange! Could it be that in the schedule, tasks are taken one by one and cannot be taken at the same time?
Control the speed in the console. Regardless of whether you are distributed or not, as long as the speed is set to the same, it will take the same time. Only when the hardware resources are insufficient (or when the hardware has a bottleneck and cannot reach the speed you set) will the distributed system run faster? I understand that personally