Conscience recommendation! 8 essential skills for Python crawler masters!-Python Tutorial-php.cn

Conscience recommendation! 8 essential skills for Python crawler masters!

If you want to quickly learn crawlers, the most worthwhile language to learn must be Python. Python has many application scenarios, such as: rapid web development, crawlers, automated operation and maintenance, etc. It can be done simply Website, automatic posting script, email sending and receiving script, simple verification code recognition script.

There are also many reuse processes in the development process of crawlers. Today I will summarize the 8 essential skills, which can save time and effort in the future and complete tasks efficiently.

1. Basic crawling of web pages

get method

import urllib2
url = "http://www.baidu.com"
response = urllib2.urlopen(url)
print response.read()

Copy after login

post method

import urllib
import urllib2
url = "http://abcde.com"
form = {'name':'abc','password':'1234'}
form_data = urllib.urlencode(form)
request = urllib2.Request(url,form_data)
response = urllib2.urlopen(request)
print response.read()

Copy after login

2.Use proxy IP

In the process of developing crawlers, we often encounter situations where the IP is blocked. In this case, we need to use the proxy IP; there is a ProxyHandler class in the urllib2 package. Through this class, we can set up a proxy to access the web page, as shown in the following code snippet:

import urllib2
proxy = urllib2.ProxyHandler({'http': '127.0.0.1:8087'})
opener = urllib2.build_opener(proxy)
urllib2.install_opener(opener)
response = urllib2.urlopen('http://www.baidu.com')
print response.read()

Copy after login

3. Cookies processing

Cookies are data (usually encrypted) stored on the user's local terminal by some websites in order to identify the user's identity and perform session tracking. Python provides the cookielib module for processing cookies. , the main function of the cookielib module is to provide objects that can store cookies, so that it can be used in conjunction with the urllib2 module to access Internet resources. Search the public account on WeChat: Architect Guide, reply: Architect Get Information.

Code snippet:

import urllib2, cookielib
cookie_support= urllib2.HTTPCookieProcessor(cookielib.CookieJar())
opener = urllib2.build_opener(cookie_support)
urllib2.install_opener(opener)
content = urllib2.urlopen('http://XXXX').read()

Copy after login

The key is CookieJar(), which is used to manage HTTP cookie values, store cookies generated by HTTP requests, and add cookie objects to outgoing HTTP requests. . The entire cookie is stored in memory, and the cookie will be lost after garbage collection of the CookieJar instance. All processes do not need to be operated separately.

Add cookies manually:

cookie = "PHPSESSID=91rurfqm2329bopnosfu4fvmu7; kmsign=55d2c12c9b1e3; KMUID=b6Ejc1XSwPq9o756AxnBAg="
request.add_header("Cookie", cookie)

Copy after login

4. Disguise as a browser

Some websites are disgusted with the visit of crawlers, so they reject requests from crawlers. Therefore, HTTP Error 403: Forbidden often occurs when using urllib2 to directly access the website.

Pay special attention to some headers. The server will check these headers:

User-Agent Some servers or Proxy will check this value. Use To determine whether it is a Request initiated by the browser
Content-Type When using the REST interface, the Server will check this value to determine how the content in the HTTP Body should be parsed

This can be achieved by modifying the header in the http package. The code snippet is as follows:

import urllib2
headers = {
 'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6'
}
request = urllib2.Request(
 url = 'http://my.oschina.net/jhao104/blog?catalog=3463517',
 headers = headers
)
print urllib2.urlopen(request).read()

Copy after login

5. Page parsing

The most powerful tool for page parsing is of course regular expressions. Expression, this is different for different users of different websites, so there is no need to explain too much

The second is the parsing library, the two commonly used ones are lxml and BeautifulSoup

For these two libraries, my evaluation is that they are both HTML/XML processing libraries. Beautifulsoup is implemented purely in python, which is inefficient, but has practical functions. For example, the source code of an HTML node can be obtained through search results; lxml C language coding, efficient, supports Xpath.

6. Processing of verification codes

For some simple verification codes, simple identification can be performed. I have only done some simple verification code recognition. However, some anti-human verification codes, such as 12306, can be manually coded through the coding platform. Of course, this requires a fee.

7. Gzip compression

Have you ever encountered some web pages that are garbled no matter how they are transcoded? Haha, that means you don’t know that many web services have the ability to send compressed data, which can reduce the large amount of data transmitted on network lines by more than 60%. This is especially true for XML web services, since XML data can be compressed to a very high degree.

But generally the server will not send compressed data for you unless you tell the server that you can handle compressed data.

So you need to modify the code like this:

import urllib2, httplib
request = urllib2.Request('http://xxxx.com')
request.add_header('Accept-encoding', 'gzip')
opener = urllib2.build_opener()
f = opener.open(request)

Copy after login

This is the key: create a Request object and add an Accept-encoding header to tell the server that you can accept gzip compressed data.

Then it’s time to decompress the data:

import StringIO
import gzip
compresseddata = f.read()
compressedstream = StringIO.StringIO(compresseddata)
gzipper = gzip.GzipFile(fileobj=compressedstream)
print gzipper.read()

Copy after login

8. Multi-threaded concurrent crawling

If a single thread is too slow, multi-threading is needed. Here is one This simple thread pool template program simply prints 1-10, but it can be seen that it is concurrent.

Although Python's multi-threading is useless, it can still improve efficiency to a certain extent for network-frequent crawlers.

from threading import Thread
from Queue import Queue
from time import sleep
# q是任务队列
#NUM是并发线程总数
#JOBS是有多少任务
q = Queue()
NUM = 2
JOBS = 10
#具体的处理函数，负责处理单个任务
def do_somthing_using(arguments):
 print arguments
#这个是工作进程，负责不断从队列取数据并处理
def working():
 while True:
 arguments = q.get()
 do_somthing_using(arguments)
 sleep(1)
 q.task_done()
#fork NUM个线程等待队列
for i in range(NUM):
 t = Thread(target=working)
 t.setDaemon(True)
 t.start()
#把JOBS排入队列
for i in range(JOBS):
 q.put(i)
#等待所有JOBS完成
q.join()

Copy after login

The above is the detailed content of Conscience recommendation! 8 essential skills for Python crawler masters!. For more information, please follow other related articles on the PHP Chinese website!