Some websites will have corresponding anti-crawler measures. For example, many websites will detect the number of visits to a certain IP in a certain period of time. If the frequency of visits is too fast and does not look like a normal visitor, it may ban the IP. Access. So we need to set up some proxy servers and change the proxy every once in a while. Even if the IP is banned, you can still change the IP and continue crawling.
In Python, you can use the ProxyHandler in urllib2 to set up a proxy server. The following code explains how to use the proxy:
import urllib2 # 构建了两个代理Handler,一个有代理IP,一个没有代理IP httpproxy_handler = urllib2.ProxyHandler({"http" : "124.88.67.81:80"}) nullproxy_handler = urllib2.ProxyHandler({}) #定义一个代理开关 proxySwitch = True # 通过 urllib2.build_opener()方法使用这些代理Handler对象,创建自定义opener对象 # 根据代理开关是否打开,使用不同的代理模式 if proxySwitch: opener = urllib2.build_opener(httpproxy_handler) else: opener = urllib2.build_opener(nullproxy_handler) request = urllib2.Request("http://www.baidu.com/") # 使用opener.open()方法发送请求才使用自定义的代理,而urlopen()则不使用自定义代理。 response = opener.open(request) # 就是将opener应用到全局,之后所有的,不管是opener.open()还是urlopen() 发送请求,都将使用自定义代理。 # urllib2.install_opener(opener) # response = urlopen(request) print response.read()
Used above It is a free open proxy. We can collect these free proxies on some proxy websites. If it can be used after testing, we will collect it and use it on the crawler.
Related recommendations: "Python Video Tutorial"
Free agent website:
西thornfree agent
Fast proxy free proxy
National proxy ip
If you have enough proxies, you can put them in a list and randomly select a proxy to access the website. As follows:
import urllib2 import random proxy_list = [ {"http" : "124.88.67.81:80"}, {"http" : "124.88.67.81:80"}, {"http" : "124.88.67.81:80"}, {"http" : "124.88.67.81:80"}, {"http" : "124.88.67.81:80"} ] # 随机选择一个代理 proxy = random.choice(proxy_list) # 使用选择的代理构建代理处理器对象 httpproxy_handler = urllib2.ProxyHandler(proxy) opener = urllib2.build_opener(httpproxy_handler) request = urllib2.Request("http://www.baidu.com/") response = opener.open(request) print response.read()
The above are all free proxies, which are not very stable and often cannot be used. At this time, you can consider using a private proxy. That is to say, purchase an agent from an agent supplier. The supplier will provide a valid agent with its own username and password. The specific use is the same as that of a free agent. This is an additional account authentication, as follows:
# 构建具有一个私密代理IP的Handler,其中user为账户,passwd为密码 httpproxy_handler = urllib2.ProxyHandler({"http" : "user:passwd@124.88.67.81:80"})
The above is The method of setting up a proxy using urllib2 seems a bit troublesome. Let's take a look at how to use reqursts to use the proxy.
Use free proxy:
import requests # 如果代理需要使用HTTP Basic Auth,可以使用下面这种格式: proxy = { "http": "mr_mao_hacker:sffqry9r@61.158.163.130:16816" } response = requests.get("http://www.baidu.com", proxies = proxy) print response.text
Note: You can write the account password into the environment variable to avoid leakage
The above is the detailed content of How to set up a proxy for Python crawler. For more information, please follow other related articles on the PHP Chinese website!