python - The distributed operation of pyspider is successful, and 2 slaves are running, but the time is not shortened?
滿天的星座
滿天的星座 2017-06-28 09:22:42
0
1
990

1 master, 2 slaves, under the virtual machine ubuntu, the configuration is as follows:

master’s config.json

{
  "taskdb": "mysql+taskdb://pyspider:pyspider-pass@192.168.209.128:3306/taskdb",
  "projectdb": "mysql+projectdb://pyspider:pyspider-pass@192.168.209.128:3306/projectdb",
  "resultdb": "mysql+resultdb://pyspider:pyspider-pass@192.168.209.128:3306/resultdb",
  "message_queue": "redis://192.168.209.128:6379/db",
  "phantomjs-proxy": "192.168.209.128:25555",
  "scheduler":{
     "xmlrpc-host":"0.0.0.0",
     "delete-time":10},
  "webui": {
    "port": 5555,
    "username": "",
    "password": "",
    "need-auth": false}
}

Run on host

/usr/local/bin/pyspider -c /home/pu/pyspider/conf.json schedule
/usr/local/bin/pyspider -c /home/pu/pyspider/conf.json webui
/usr/local/bin/pyspider -c /home/pu/pyspider/conf.json phantomjs

slave’s config.json:

{
  "taskdb": "mysql+taskdb://pyspider:pyspider-pass@192.168.209.128:3306/taskdb",
  "projectdb": "mysql+projectdb://pyspider:pyspider-pass@192.168.209.128:3306/projectdb",
  "resultdb": "mysql+resultdb://pyspider:pyspider-pass@192.168.209.128:3306/resultdb",
  "message_queue": "redis://192.168.209.128:6379/db",
  "phantomjs-proxy": "192.168.209.128:25555",
  "fetcher":{"xmlrpc-host":"192.168.209.128"}
}

Run on two slave machines

/usr/local/bin/pyspider -c /home/pu/pyspider/conf.json fetcher
/usr/local/bin/pyspider -c /home/pu/pyspider/conf.json processor
/usr/local/bin/pyspider -c /home/pu/pyspider/conf.json result_worker

Three terminals

I ran it from the command line first, and I haven’t used Supervisor to manage the process. I want to use this management process after the distributed debugging is successful. The command line just opens a few more terminals. But it's strange. The crawler program can run smoothly, but the time taken to run it on a single machine is the same as when running three machines together. It's just a few seconds different. Can you please explain?

I looked at the information output by the terminal. It is that the URLs extracted by the two slaves are not repeated, but the time is separated by intervals. For example, slave1 runs for 4 seconds, and then slave2 runs for 3 seconds. They are not parallel. It's in order, so strange! Could it be that in the schedule, tasks are taken one by one and cannot be taken at the same time?

滿天的星座
滿天的星座

reply all(1)
迷茫

Control the speed in the console. Regardless of whether you are distributed or not, as long as the speed is set to the same, it will take the same time. Only when the hardware resources are insufficient (or when the hardware has a bottleneck and cannot reach the speed you set) will the distributed system run faster? I understand that personally

Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template