Web crawler - how to cooperate with requests in python's multi-process
阿神
阿神 2017-06-22 11:52:30
0
2
778

This is the code for sequential execution of a single process:

import requests,time,os,random

def img_down(url):
    with open("{}".format(str(random.random())+os.path.basename(url)),"wb") as fob:
        fob.write(requests.get(url).content)

urllist=[]
with open("urllist.txt","r+") as u:
    for a in u.readlines():
        urllist.append(a.strip())

s=time.clock()
for i in range(len(urllist)):
    img_down(urllist[i])
e=time.clock()

print ("time: %d" % (e-s))

This is the code for multi-process:

from multiprocessing import Pool
import requests,os,time,random

def img_down(url):
    with open("{}".format(str(random.random())+os.path.basename(url)),"wb") as fob:
        fob.write(requests.get(url).content)

if __name__=="__main__":
    urllist=[]
    with open("urllist.txt","r+") as urlfob:
        for s in urlfob.readlines():
            urllist.append(s.strip())

    s=time.clock()
    p=Pool()
    for i in range(len(urllist)):
        p.apply_async(img_down,args=(urllist[i],))
    p.close()
    p.join()
    e=time.clock()
    
    print ("time: {}".format(e-s))

But there is almost no difference between the time spent in single process and multi-process. The problem is probably that requests block IO. Is your understanding correct? How to modify the code to achieve the purpose of multi-process?
Thanks!

阿神
阿神

闭关修行中......

reply all(2)
phpcn_u1582

The bottleneck of writing files is disk IO, not CPU. Parallelism does not have much effect. You can try not to write files and then compare the times

刘奇

Pool without parameters uses
os.cpu_count() or 1
If it is a single-core CPU, or the number cannot be collected, there is only one process.

That should be the reason.

Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template