How to prevent IP being blocked by python crawler

little bottle
Release: 2019-04-10 17:07:35
forward
3195 people have browsed it

When writing a crawler to crawl data, especially when crawling a large amount of data, because many websites have anti-crawler measures, it is easy to have their IP blocked and cannot continue to crawl. This article summarizes some countermeasures on how to solve this problem. These measures can be used alone or at the same time for better results.

Fake User-Agent

Set the User-Agent in the request header to the User-Agent in the browser to fake browser access. For example:

headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'}
resp = requests.get(url,headers = headers)
Copy after login

Set a random time interval between each repeated crawling

# 比如:
time.sleep(random.randint(0,3))  # 暂停0~3秒的整数秒,时间区间:[0,3]
# 或:
time.sleep(random.random())  # 暂停0~1秒,时间区间:[0,1)
Copy after login

Fake cookies

If you can access a page normally from the browser, you can Copy the cookies in the browser and use them, for example:

cookies = dict(uuid='b18f0e70-8705-470d-bc4b-09a8da617e15',UM_distinctid='15d188be71d50-013c49b12ec14a-3f73035d-100200-15d188be71ffd')
resp = requests.get(url,cookies = cookies)
Copy after login
# 把浏览器的cookies字符串转成字典
def cookies2dict(cookies):
    items = cookies.split(';')
    d = {}
    for item in items:
        kv = item.split('=',1)
        k = kv[0]
        v = kv[1]
        d[k] = v
    return d
Copy after login

Note: After using browser cookies to initiate a request, if the request frequency is too frequent, the IP will still be blocked. At this time, you can perform the corresponding actions on the browser. Manual verification (such as clicking on the verification image, etc.), and then you can continue to use the cookie to initiate requests normally.

Use proxy

You can use multiple proxy IPs for access to prevent the same IP from launching too many requests and being blocked, such as:

proxies = {'http':'http://10.10.10.10:8765','https':'https://10.10.10.10:8765'}
resp = requests.get(url,proxies = proxies)
# 注:免费的代理IP可以在这个网站上获取:http://www.xicidaili.com/nn/
Copy after login

[Recommended courses :Python video tutorial

The above is the detailed content of How to prevent IP being blocked by python crawler. For more information, please follow other related articles on the PHP Chinese website!

Related labels:
source:cnblogs.com
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Latest Articles by Author
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template