Use multiple processes to crawl data and write it to a file. No error is reported when running, but the file is garbled when opened.
There is no such problem when rewriting with multi-threading, everything is normal.
The following is the code for writing data to a file:
def Get_urls(start_page,end_page):
print ' run task {} ({})'.format(start_page,os.getpid())
url_text = codecs.open('url.txt','a','utf-8')
for i in range(start_page,end_page+1):
pageurl=baseurl1+str(i)+baseurl2+searchword
response = requests.get(pageurl, headers=header)
soup = BeautifulSoup(response.content, 'html.parser')
a_list=soup.find_all('a')
for a in a_list:
if a.text!=''and 'wssd_content.jsp?bookid'in a['href']:
text=a.text.strip()
url=baseurl+str(a['href'])
url_text.write(text+'\t'+url+'\n')
url_text.close()
Process pool for multiple processes
def Multiple_processes_test():
t1 = time.time()
print 'parent process {} '.format(os.getpid())
page_ranges_list = [(1,3),(4,6),(7,9)]
pool = multiprocessing.Pool(processes=3)
for page_range in page_ranges_list:
pool.apply_async(func=Get_urls,args=(page_range[0],page_range[1]))
pool.close()
pool.join()
t2 = time.time()
print '时间:',t2-t1
As mentioned in the picture, the file is loaded in the wrong encoding, which means that when you write in multiple processes, the encoding is not utf-8
Add the first line of the file:
Opening the same file is quite dangerous, and the probability of error is quite high. If
multi-threading does not make an error, it is most likely GIL,
multiple processes do not have locks, so it is easy to make errors.
It is recommended to change to the producer-consumer model!
Like this
Results
foo /4/wssd_content.jsp?bookid
foo /5/wssd_content.jsp?bookid
foo /6/wssd_content.jsp?bookid
foo /1/wssd_content.jsp?bookid
foo /2/wssd_content.jsp?bookid
foo /3/wssd_content.jsp?bookid
foo /7/wssd_content.jsp?bookid
foo /8/wssd_content.jsp?bookid
foo /9/wssd_content.jsp?bookid