Home > Backend Development > Python Tutorial > How to use python crawler to crawl IP proxies in batches (code)

How to use python crawler to crawl IP proxies in batches (code)

不言
Release: 2019-03-15 13:40:29
forward
21566 people have browsed it

本篇文章给大家带来的内容是关于python爬虫批量抓取ip代理的方法(代码),有一定的参考价值,有需要的朋友可以参考一下,希望对你有所帮助。

使用爬虫抓取数据时,经常要用到多个ip代理,防止单个ip访问太过频繁被封禁。
ip代理可以从这个网站获取:http://www.xicidaili.com/nn/。
因此写一个python程序来获取ip代理,保存到本地。
python版本:3.6.3

#grab ip proxies from xicidaili
import sys, time, re, requests
from multiprocessing.dummy import Pool as ThreadPool
from lxml import etree

IP_POOL = 'ip_pool.py'
URL = 'http://www.xicidaili.com/nn/' #IP代理 高匿
#URL = 'http://www.xicidaili.com/wt/' #IP代理 http
RUN_TIME = time.strftime("%Y-%m-%d %H:%M", time.localtime()) #执行时间

#用字典存放有效ip代理
alive_ip = {'http': [], 'https': []}
#多线程
pool = ThreadPool(20)

#返回html文本
def get_html(url):
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:55.0) Gecko/20100101 Firefox/55.0",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3",
        "Accept-Encoding": "gzip, deflate",
        "Referer": "https://www.xicidaili.com/",
        "Connection": "keep-alive",
        "Upgrade-Insecure-Requests": "1"
    }
    r = requests.get(url, headers=headers)
    r.encoding = 'utf-8'
    return r.text

#测试ip代理是否存活
def test_alive(proxy):
    global alive_ip
    proxies = {'http': proxy}
    try:
        r = requests.get('https://www.baidu.com', proxies=proxies, timeout=3)
        if r.status_code == 200:
            if proxy.startswith('https'):
                alive_ip['https'].append(proxy)
            else:
                alive_ip['http'].append(proxy)
    except:
        print("%s无效!"%proxy)

#解析html文本,获取ip代理
def get_alive_ip_address():
    iplist = []
    html = get_html(URL)
    selector = etree.HTML(html)
    table = selector.xpath('//table[@id="ip_list"]')[0]
    lines = table.xpath('./tr')[1:]
    for line in lines:
        speed, connect_time = line.xpath('.//div/@title')
        data = line.xpath('./td')
        ip = data[1].xpath('./text()')[0]
        port = data[2].xpath('./text()')[0]
        anonymous = data[4].xpath('./text()')[0]
        ip_type = data[5].xpath('./text()')[0]
        #过滤掉速度慢和非高匿的ip代理
        if float(speed[:-1])>1 or float(connect_time[:-1])>1 or anonymous != '高匿':
            continue
        iplist.append(ip_type.lower() + '://' + ip + ':' + port)
    pool.map(test_alive, iplist)

#把抓取到的有效ip代理写入到本地
def write_txt(output_file):
    with open(output_file, 'w') as f:
        f.write('#create time: %s\n\n' % RUN_TIME)
        f.write('http_ip_pool = \\\n')
        f.write(str(alive_ip['http']).replace(',', ',\n'))
        f.write('\n\n')
    with open(output_file, 'a') as f:
        f.write('https_ip_pool = \\\n')
        f.write(str(alive_ip['https']).replace(',', ',\n'))
    print('write successful: %s' % output_file)

def main():
    get_alive_ip_address()
    write_txt(output_file)

if __name__ == '__main__':
    try:
        output_file = sys.argv[1] #第一个参数作为文件名
    except:
        output_file = IP_POOL
    main()
Copy after login

运行程序:

root@c:test$ python get_ip_proxies.pywrite successful: ip_pool.py
Copy after login

查看文件:

root@c:test$ vim ip_pool.py
Copy after login

#create time: 2019-03-14 19:53

http_ip_pool = \
['http://183.148.152.1:9999',
 'http://112.85.165.234:9999',
 'http://112.87.69.162:9999',
 'http://111.77.197.10:9999',
 'http://113.64.94.80:8118',
 'http://61.184.109.33:61320',
 'http://125.126.204.82:9999',
 'http://125.126.218.8:9999',
 'http://36.26.224.56:9999',
 'http://123.162.168.192:40274',
 'http://116.209.54.125:9999',
 'http://183.148.148.211:9999',
 'http://111.177.161.111:9999',
 'http://116.209.58.245:9999',
 'http://183.148.143.38:9999',
 'http://116.209.55.218:9999',
 'http://114.239.250.15:9999',
 'http://116.209.54.109:9999',
 'http://125.123.143.98:9999',
 'http://183.6.130.6:8118',
 'http://183.148.143.166:9999',
 'http://125.126.203.228:9999',
 'http://111.79.198.74:9999',
 'http://116.209.53.215:9999',
 'http://112.87.69.124:9999',
 'http://112.80.198.13:8123',
 'http://182.88.160.16:8123',
 'http://116.209.56.24:9999',
 'http://112.85.131.25:9999',
 'http://116.209.52.234:9999',
 'http://175.165.128.223:1133',
 'http://122.4.47.199:8010',
 'http://112.85.170.204:9999',
 'http://49.86.178.206:9999',
 'http://125.126.215.187:9999']

https_ip_pool = \
['https://183.148.156.98:9999',
 'https://111.79.199.167:808',
 'https://61.142.72.150:39894',
 'https://119.254.94.71:42788',
 'https://221.218.102.146:33323',
 'https://122.193.246.29:9999',
 'https://183.148.139.173:9999',
 'https://60.184.194.157:3128',
 'https://118.89.138.129:52699',
 'https://112.87.71.67:9999',
 'https://58.56.108.226:43296',
 'https://182.207.232.135:50465',
 'https://111.177.186.32:9999',
 'https://58.210.133.98:32741',
 'https://115.221.116.71:9999',
 'https://183.148.140.191:9999',
 'https://183.148.130.143:9999',
 'https://116.209.54.84:9999',
 'https://125.126.219.125:9999',
 'https://112.85.167.158:9999',
 'https://112.85.173.76:9999',
 'https://60.173.244.133:41306',
 'https://183.148.147.223:9999',
 'https://116.209.53.68:9999',
 'https://111.79.198.102:9999',
 'https://123.188.5.11:1133',
 'https://60.190.66.131:56882',
 'https://112.85.168.140:9999',
 'https://110.250.65.108:8118',
 'https://221.208.39.160:8118',
 'https://116.209.53.77:9999',
 'https://116.209.58.29:9999',
 'https://183.148.141.129:9999',
 'https://124.89.33.59:53281',
 'https://116.209.57.149:9999',
 'https://58.62.238.150:32431',
 'https://218.76.253.201:61408']
Copy after login

之后就可以直接使用了

from ip_pool import http_ip_pool, https_ip_pool
Copy after login

The above is the detailed content of How to use python crawler to crawl IP proxies in batches (code). For more information, please follow other related articles on the PHP Chinese website!

Related labels:
source:cnblogs.com
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template