How to set up a proxy for Python crawler
Some websites will have corresponding anti-crawler measures. For example, many websites will detect the number of visits to a certain IP in a certain period of time. If the frequency of visits is too fast and does not look like a normal visitor, it may ban the IP. Access. So we need to set up some proxy servers and change the proxy every once in a while. Even if the IP is banned, you can still change the IP and continue crawling.
In Python, you can use the ProxyHandler in urllib2 to set up a proxy server. The following code explains how to use the proxy:
import urllib2 # 构建了两个代理Handler,一个有代理IP,一个没有代理IP httpproxy_handler = urllib2.ProxyHandler({"http" : "124.88.67.81:80"}) nullproxy_handler = urllib2.ProxyHandler({}) #定义一个代理开关 proxySwitch = True # 通过 urllib2.build_opener()方法使用这些代理Handler对象,创建自定义opener对象 # 根据代理开关是否打开,使用不同的代理模式 if proxySwitch: opener = urllib2.build_opener(httpproxy_handler) else: opener = urllib2.build_opener(nullproxy_handler) request = urllib2.Request("http://www.baidu.com/") # 使用opener.open()方法发送请求才使用自定义的代理,而urlopen()则不使用自定义代理。 response = opener.open(request) # 就是将opener应用到全局,之后所有的,不管是opener.open()还是urlopen() 发送请求,都将使用自定义代理。 # urllib2.install_opener(opener) # response = urlopen(request) print response.read()
Used above It is a free open proxy. We can collect these free proxies on some proxy websites. If it can be used after testing, we will collect it and use it on the crawler.
Related recommendations: "Python Video Tutorial"
Free agent website:
西thornfree agent
Fast proxy free proxy
National proxy ip
If you have enough proxies, you can put them in a list and randomly select a proxy to access the website. As follows:
import urllib2 import random proxy_list = [ {"http" : "124.88.67.81:80"}, {"http" : "124.88.67.81:80"}, {"http" : "124.88.67.81:80"}, {"http" : "124.88.67.81:80"}, {"http" : "124.88.67.81:80"} ] # 随机选择一个代理 proxy = random.choice(proxy_list) # 使用选择的代理构建代理处理器对象 httpproxy_handler = urllib2.ProxyHandler(proxy) opener = urllib2.build_opener(httpproxy_handler) request = urllib2.Request("http://www.baidu.com/") response = opener.open(request) print response.read()
The above are all free proxies, which are not very stable and often cannot be used. At this time, you can consider using a private proxy. That is to say, purchase an agent from an agent supplier. The supplier will provide a valid agent with its own username and password. The specific use is the same as that of a free agent. This is an additional account authentication, as follows:
# 构建具有一个私密代理IP的Handler,其中user为账户,passwd为密码 httpproxy_handler = urllib2.ProxyHandler({"http" : "user:passwd@124.88.67.81:80"})
The above is The method of setting up a proxy using urllib2 seems a bit troublesome. Let's take a look at how to use reqursts to use the proxy.
Use free proxy:
import requests # 如果代理需要使用HTTP Basic Auth,可以使用下面这种格式: proxy = { "http": "mr_mao_hacker:sffqry9r@61.158.163.130:16816" } response = requests.get("http://www.baidu.com", proxies = proxy) print response.text
Note: You can write the account password into the environment variable to avoid leakage
The above is the detailed content of How to set up a proxy for Python crawler. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics

Regarding the problem of removing the Python interpreter that comes with Linux systems, many Linux distributions will preinstall the Python interpreter when installed, and it does not use the package manager...

Pylance type detection problem solution when using custom decorator In Python programming, decorator is a powerful tool that can be used to add rows...

Using python in Linux terminal...

The problem and solution of the child process continuing to run when using signals to kill the parent process. In Python programming, after killing the parent process through signals, the child process still...

Error loading Pickle file in Python 3.6 environment: ModuleNotFoundError:Nomodulenamed...

Solve the problem of errors in creating a scaffolding project by HttpRunner. When using HttpRunner for interface testing, its scaffolding function is often used to create a project. �...

"DebianStrings" is not a standard term, and its specific meaning is still unclear. This article cannot directly comment on its browser compatibility. However, if "DebianStrings" refers to a web application running on a Debian system, its browser compatibility depends on the technical architecture of the application itself. Most modern web applications are committed to cross-browser compatibility. This relies on following web standards and using well-compatible front-end technologies (such as HTML, CSS, JavaScript) and back-end technologies (such as PHP, Python, Node.js, etc.). To ensure that the application is compatible with multiple browsers, developers often need to conduct cross-browser testing and use responsiveness

Modifying XML content requires programming, because it requires accurate finding of the target nodes to add, delete, modify and check. The programming language has corresponding libraries to process XML and provides APIs to perform safe, efficient and controllable operations like operating databases.
