This article mainly introduces the multi-threaded web page crawling function of Python. It analyzes the relevant operating techniques and precautions of Python multi-threaded programming in detail based on specific examples. It also comes with a demo example that gives the multi-threaded web page crawling method. For the implementation method, friends who need it can refer to
. The example in this article describes the implementation of multi-threaded web page crawling function in Python. Share it with everyone for your reference, the details are as follows:
Recently, I have been doing things related to web crawlers. I took a look at the larbin crawler written in open source C++, and carefully read the design ideas and the implementation of some key technologies.
1. Larbin’s URL de-reuse is a very efficient bloom filter algorithm;
2. DNS processing uses the adns asynchronous open source component;
3. For URL queue processing, it is Use a strategy of partially caching into memory and partially writing to files.
4. Larbin has done a lot of work on file-related operations.
5. There is a connection pool in larbin. By creating a socket, it sends the GET method in the HTTP protocol to the target site, obtains the content, and then parses the header. Class things
6. A large number of descriptors, I/O multiplexing through the poll method, very efficient
7. Larbin is highly configurable
8. A large number of data structures used by the author are his own I started from the bottom and basically didn’t use things like STL
...
There are many more. I will write an article and summarize them when I have time in the future.
In the past two days, I wrote a multi-threaded page download program in python. For I/O-intensive applications, multi-threading is obviously a good solution. The thread pool I just wrote can also be used. In fact, it is very simple to use Python to crawl pages. There is a urllib2 module, which is very convenient to use and can be done in basically two or three lines of code. Although using third-party modules can solve problems very conveniently, it is of no benefit to personal technical accumulation, because the key algorithms are implemented by others, not yourself. You simply don’t know many details. Unable to understand. As technology professionals, we cannot just use modules or APIs written by others. We must implement them ourselves so that we can learn more.
I decided to start from socket, which also encapsulates the GET protocol and parses the header. It can also handle the DNS parsing process separately, such as DNS caching, so if I write it myself, it will be more controllable. More conducive to expansion. For timeout processing, I use a global 5-second timeout processing. For relocation (301or302) processing, the maximum relocation is 3 times, because during the previous testing process, I found that many sites' relocations were redirected to myself. This creates an infinite loop, so an upper limit is set. The specific principle is relatively simple, just look at the code.
After I finished writing it, I compared the performance with urllib2. I found that the efficiency of my own writing was relatively high, and the error rate of urllib2 was slightly higher. I don’t know why. Some people on the Internet say that urllib2 has some minor problems in a multi-threaded context, but I am not particularly clear about the details.
First paste the code:
fetchPage.py Use the Get method of the Http protocol to download the page and store it For the file
''' Created on 2012-3-13 Get Page using GET method Default using HTTP Protocol , http port 80 @author: xiaojay ''' import socket import statistics import datetime import threading socket.setdefaulttimeout(statistics.timeout) class Error404(Exception): '''Can not find the page.''' pass class ErrorOther(Exception): '''Some other exception''' def __init__(self,code): #print 'Code :',code pass class ErrorTryTooManyTimes(Exception): '''try too many times''' pass def downPage(hostname ,filename , trytimes=0): try : #To avoid too many tries .Try times can not be more than max_try_times if trytimes >= statistics.max_try_times : raise ErrorTryTooManyTimes except ErrorTryTooManyTimes : return statistics.RESULTTRYTOOMANY,hostname+filename try: s = socket.socket(socket.AF_INET,socket.SOCK_STREAM) #DNS cache if statistics.DNSCache.has_key(hostname): addr = statistics.DNSCache[hostname] else: addr = socket.gethostbyname(hostname) statistics.DNSCache[hostname] = addr #connect to http server ,default port 80 s.connect((addr,80)) msg = 'GET '+filename+' HTTP/1.0\r\n' msg += 'Host: '+hostname+'\r\n' msg += 'User-Agent:xiaojay\r\n\r\n' code = '' f = None s.sendall(msg) first = True while True: msg = s.recv(40960) if not len(msg): if f!=None: f.flush() f.close() break # Head information must be in the first recv buffer if first: first = False headpos = msg.index("\r\n\r\n") code,other = dealwithHead(msg[:headpos]) if code=='200': #statistics.fetched_url += 1 f = open('pages/'+str(abs(hash(hostname+filename))),'w') f.writelines(msg[headpos+4:]) elif code=='301' or code=='302': #if code is 301 or 302 , try down again using redirect location if other.startswith("http") : hname, fname = parse(other) downPage(hname,fname,trytimes+1)#try again else : downPage(hostname,other,trytimes+1) elif code=='404': raise Error404 else : raise ErrorOther(code) else: if f!=None :f.writelines(msg) s.shutdown(socket.SHUT_RDWR) s.close() return statistics.RESULTFETCHED,hostname+filename except Error404 : return statistics.RESULTCANNOTFIND,hostname+filename except ErrorOther: return statistics.RESULTOTHER,hostname+filename except socket.timeout: return statistics.RESULTTIMEOUT,hostname+filename except Exception, e: return statistics.RESULTOTHER,hostname+filename def dealwithHead(head): '''deal with HTTP HEAD''' lines = head.splitlines() fstline = lines[0] code =fstline.split()[1] if code == '404' : return (code,None) if code == '200' : return (code,None) if code == '301' or code == '302' : for line in lines[1:]: p = line.index(':') key = line[:p] if key=='Location' : return (code,line[p+2:]) return (code,None) def parse(url): '''Parse a url to hostname+filename''' try: u = url.strip().strip('\n').strip('\r').strip('\t') if u.startswith('http://') : u = u[7:] elif u.startswith('https://'): u = u[8:] if u.find(':80')>0 : p = u.index(':80') p2 = p + 3 else: if u.find('/')>0: p = u.index('/') p2 = p else: p = len(u) p2 = -1 hostname = u[:p] if p2>0 : filename = u[p2:] else : filename = '/' return hostname, filename except Exception ,e: print "Parse wrong : " , url print e def PrintDNSCache(): '''print DNS dict''' n = 1 for hostname in statistics.DNSCache.keys(): print n,'\t',hostname, '\t',statistics.DNSCache[hostname] n+=1 def dealwithResult(res,url): '''Deal with the result of downPage''' statistics.total_url+=1 if res==statistics.RESULTFETCHED : statistics.fetched_url+=1 print statistics.total_url , '\t fetched :', url if res==statistics.RESULTCANNOTFIND : statistics.failed_url+=1 print "Error 404 at : ", url if res==statistics.RESULTOTHER : statistics.other_url +=1 print "Error Undefined at : ", url if res==statistics.RESULTTIMEOUT : statistics.timeout_url +=1 print "Timeout ",url if res==statistics.RESULTTRYTOOMANY: statistics.trytoomany_url+=1 print e ,"Try too many times at", url if __name__=='__main__': print 'Get Page using GET method'
below, I will use the thread pool in the previous article as an auxiliary to implement parallel crawling under multi-threads, and use the download I wrote above Let’s compare the performance of the page method and urllib2.
''' Created on 2012-3-16 @author: xiaojay ''' import fetchPage import threadpool import datetime import statistics import urllib2 '''one thread''' def usingOneThread(limit): urlset = open("input.txt","r") start = datetime.datetime.now() for u in urlset: if limit <= 0 : break limit-=1 hostname , filename = parse(u) res= fetchPage.downPage(hostname,filename,0) fetchPage.dealwithResult(res) end = datetime.datetime.now() print "Start at :\t" , start print "End at :\t" , end print "Total Cost :\t" , end - start print 'Total fetched :', statistics.fetched_url '''threadpoll and GET method''' def callbackfunc(request,result): fetchPage.dealwithResult(result[0],result[1]) def usingThreadpool(limit,num_thread): urlset = open("input.txt","r") start = datetime.datetime.now() main = threadpool.ThreadPool(num_thread) for url in urlset : try : hostname , filename = fetchPage.parse(url) req = threadpool.WorkRequest(fetchPage.downPage,args=[hostname,filename],kwds={},callback=callbackfunc) main.putRequest(req) except Exception: print Exception.message while True: try: main.poll() if statistics.total_url >= limit : break except threadpool.NoResultsPending: print "no pending results" break except Exception ,e: print e end = datetime.datetime.now() print "Start at :\t" , start print "End at :\t" , end print "Total Cost :\t" , end - start print 'Total url :',statistics.total_url print 'Total fetched :', statistics.fetched_url print 'Lost url :', statistics.total_url - statistics.fetched_url print 'Error 404 :' ,statistics.failed_url print 'Error timeout :',statistics.timeout_url print 'Error Try too many times ' ,statistics.trytoomany_url print 'Error Other faults ',statistics.other_url main.stop() '''threadpool and urllib2 ''' def downPageUsingUrlib2(url): try: req = urllib2.Request(url) fd = urllib2.urlopen(req) f = open("pages3/"+str(abs(hash(url))),'w') f.write(fd.read()) f.flush() f.close() return url ,'success' except Exception: return url , None def writeFile(request,result): statistics.total_url += 1 if result[1]!=None : statistics.fetched_url += 1 print statistics.total_url,'\tfetched :', result[0], else: statistics.failed_url += 1 print statistics.total_url,'\tLost :',result[0], def usingThreadpoolUrllib2(limit,num_thread): urlset = open("input.txt","r") start = datetime.datetime.now() main = threadpool.ThreadPool(num_thread) for url in urlset : try : req = threadpool.WorkRequest(downPageUsingUrlib2,args=[url],kwds={},callback=writeFile) main.putRequest(req) except Exception ,e: print e while True: try: main.poll() if statistics.total_url >= limit : break except threadpool.NoResultsPending: print "no pending results" break except Exception ,e: print e end = datetime.datetime.now() print "Start at :\t" , start print "End at :\t" , end print "Total Cost :\t" , end - start print 'Total url :',statistics.total_url print 'Total fetched :', statistics.fetched_url print 'Lost url :', statistics.total_url - statistics.fetched_url main.stop() if __name__ =='__main__': '''too slow''' #usingOneThread(100) '''use Get method''' #usingThreadpool(3000,50) '''use urllib2''' usingThreadpoolUrllib2(3000,50)
Experimental analysis:
Experimental data: larbin The 3,000 URLs captured were processed by the Mercator queue model (I implemented it in C++, and I will post a blog when I have the opportunity to do so in the future). The URL collection is random and representative. Use a thread pool of 50 threads.
Experimental environment: ubuntu10.04, good network, python2.6
Storage: small files, each page, one file for storage
PS: Because the school’s Internet access is based on traffic There is a fee to do web crawling, which is a waste of regular traffic! ! ! In a few days, we may conduct a large-scale URL download experiment and try it with hundreds of thousands of URLs.
Experimental results:
Using urllib2 , usingThreadpoolUrllib2(3000,50)
Start at : 2012-03-16 22:18:20.956054
End at : 2012-03-16 22:22:15.203018
Total Cost : 0:03:54.246964
Total url : 3001
Total fetched : 2442
Lost url: 559
Physical storage size of download page: 84088kb
Use your own getPageUsingGet, usingThreadpool(3000,50)
Start at: 2012-03-16 22: 23:40.206730
End at : 2012-03-16 22:26:26.843563
Total Cost : 0:02:46.636833
Total url : 3002
Total fetched : 2484
Lost url : 518
Error 404 : 94
Error timeout : 312
Error Try too many times 0
Error Other faults 112
The physical storage size of the download page: 87168kb
Summary: The download page program I wrote myself is very efficient and has fewer lost pages. But in fact, if you think about it, there are still many places that can be optimized. For example, the files are too scattered. The creation and release of too many small files will definitely cause a lot of performance overhead, and the program uses hash naming, which will also generate a lot of problems. In terms of calculation, if you have a good strategy, these costs can actually be omitted. In addition, for DNS, you do not need to use the DNS resolution that comes with python, because the default DNS resolution is a synchronous operation, and DNS resolution is generally time-consuming, so it can be performed in a multi-threaded asynchronous manner, coupled with appropriate DNS caching. Efficiency can be improved to a large extent. Not only that, during the actual page crawling process, there will be a large number of URLs, and it is impossible to store them in the memory at once. Instead, they should be reasonably allocated according to a certain strategy or algorithm. In short, there are still many things that need to be done in the collection page and things that can be optimized.
The above is the detailed content of Python uses multi-threading to crawl web page information. For more information, please follow other related articles on the PHP Chinese website!