Heim > Backend-Entwicklung > Python-Tutorial > So verwenden Sie den Python-Crawler zum stapelweisen Crawlen von IP-Proxys (Code)

So verwenden Sie den Python-Crawler zum stapelweisen Crawlen von IP-Proxys (Code)

不言
Freigeben: 2019-03-15 13:40:29
nach vorne
21566 Leute haben es durchsucht

本篇文章给大家带来的内容是关于python爬虫批量抓取ip代理的方法(代码),有一定的参考价值,有需要的朋友可以参考一下,希望对你有所帮助。

使用爬虫抓取数据时,经常要用到多个ip代理,防止单个ip访问太过频繁被封禁。
ip代理可以从这个网站获取:http://www.xicidaili.com/nn/。
因此写一个python程序来获取ip代理,保存到本地。
python版本:3.6.3

#grab ip proxies from xicidaili
import sys, time, re, requests
from multiprocessing.dummy import Pool as ThreadPool
from lxml import etree

IP_POOL = 'ip_pool.py'
URL = 'http://www.xicidaili.com/nn/' #IP代理 高匿
#URL = 'http://www.xicidaili.com/wt/' #IP代理 http
RUN_TIME = time.strftime("%Y-%m-%d %H:%M", time.localtime()) #执行时间

#用字典存放有效ip代理
alive_ip = {'http': [], 'https': []}
#多线程
pool = ThreadPool(20)

#返回html文本
def get_html(url):
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:55.0) Gecko/20100101 Firefox/55.0",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3",
        "Accept-Encoding": "gzip, deflate",
        "Referer": "https://www.xicidaili.com/",
        "Connection": "keep-alive",
        "Upgrade-Insecure-Requests": "1"
    }
    r = requests.get(url, headers=headers)
    r.encoding = 'utf-8'
    return r.text

#测试ip代理是否存活
def test_alive(proxy):
    global alive_ip
    proxies = {'http': proxy}
    try:
        r = requests.get('https://www.baidu.com', proxies=proxies, timeout=3)
        if r.status_code == 200:
            if proxy.startswith('https'):
                alive_ip['https'].append(proxy)
            else:
                alive_ip['http'].append(proxy)
    except:
        print("%s无效!"%proxy)

#解析html文本,获取ip代理
def get_alive_ip_address():
    iplist = []
    html = get_html(URL)
    selector = etree.HTML(html)
    table = selector.xpath('//table[@id="ip_list"]')[0]
    lines = table.xpath('./tr')[1:]
    for line in lines:
        speed, connect_time = line.xpath('.//div/@title')
        data = line.xpath('./td')
        ip = data[1].xpath('./text()')[0]
        port = data[2].xpath('./text()')[0]
        anonymous = data[4].xpath('./text()')[0]
        ip_type = data[5].xpath('./text()')[0]
        #过滤掉速度慢和非高匿的ip代理
        if float(speed[:-1])>1 or float(connect_time[:-1])>1 or anonymous != '高匿':
            continue
        iplist.append(ip_type.lower() + '://' + ip + ':' + port)
    pool.map(test_alive, iplist)

#把抓取到的有效ip代理写入到本地
def write_txt(output_file):
    with open(output_file, 'w') as f:
        f.write('#create time: %s\n\n' % RUN_TIME)
        f.write('http_ip_pool = \\\n')
        f.write(str(alive_ip['http']).replace(',', ',\n'))
        f.write('\n\n')
    with open(output_file, 'a') as f:
        f.write('https_ip_pool = \\\n')
        f.write(str(alive_ip['https']).replace(',', ',\n'))
    print('write successful: %s' % output_file)

def main():
    get_alive_ip_address()
    write_txt(output_file)

if __name__ == '__main__':
    try:
        output_file = sys.argv[1] #第一个参数作为文件名
    except:
        output_file = IP_POOL
    main()
Nach dem Login kopieren

运行程序:

root@c:test$ python get_ip_proxies.pywrite successful: ip_pool.py
Nach dem Login kopieren

查看文件:

root@c:test$ vim ip_pool.py
Nach dem Login kopieren

#create time: 2019-03-14 19:53

http_ip_pool = \
['http://183.148.152.1:9999',
 'http://112.85.165.234:9999',
 'http://112.87.69.162:9999',
 'http://111.77.197.10:9999',
 'http://113.64.94.80:8118',
 'http://61.184.109.33:61320',
 'http://125.126.204.82:9999',
 'http://125.126.218.8:9999',
 'http://36.26.224.56:9999',
 'http://123.162.168.192:40274',
 'http://116.209.54.125:9999',
 'http://183.148.148.211:9999',
 'http://111.177.161.111:9999',
 'http://116.209.58.245:9999',
 'http://183.148.143.38:9999',
 'http://116.209.55.218:9999',
 'http://114.239.250.15:9999',
 'http://116.209.54.109:9999',
 'http://125.123.143.98:9999',
 'http://183.6.130.6:8118',
 'http://183.148.143.166:9999',
 'http://125.126.203.228:9999',
 'http://111.79.198.74:9999',
 'http://116.209.53.215:9999',
 'http://112.87.69.124:9999',
 'http://112.80.198.13:8123',
 'http://182.88.160.16:8123',
 'http://116.209.56.24:9999',
 'http://112.85.131.25:9999',
 'http://116.209.52.234:9999',
 'http://175.165.128.223:1133',
 'http://122.4.47.199:8010',
 'http://112.85.170.204:9999',
 'http://49.86.178.206:9999',
 'http://125.126.215.187:9999']

https_ip_pool = \
['https://183.148.156.98:9999',
 'https://111.79.199.167:808',
 'https://61.142.72.150:39894',
 'https://119.254.94.71:42788',
 'https://221.218.102.146:33323',
 'https://122.193.246.29:9999',
 'https://183.148.139.173:9999',
 'https://60.184.194.157:3128',
 'https://118.89.138.129:52699',
 'https://112.87.71.67:9999',
 'https://58.56.108.226:43296',
 'https://182.207.232.135:50465',
 'https://111.177.186.32:9999',
 'https://58.210.133.98:32741',
 'https://115.221.116.71:9999',
 'https://183.148.140.191:9999',
 'https://183.148.130.143:9999',
 'https://116.209.54.84:9999',
 'https://125.126.219.125:9999',
 'https://112.85.167.158:9999',
 'https://112.85.173.76:9999',
 'https://60.173.244.133:41306',
 'https://183.148.147.223:9999',
 'https://116.209.53.68:9999',
 'https://111.79.198.102:9999',
 'https://123.188.5.11:1133',
 'https://60.190.66.131:56882',
 'https://112.85.168.140:9999',
 'https://110.250.65.108:8118',
 'https://221.208.39.160:8118',
 'https://116.209.53.77:9999',
 'https://116.209.58.29:9999',
 'https://183.148.141.129:9999',
 'https://124.89.33.59:53281',
 'https://116.209.57.149:9999',
 'https://58.62.238.150:32431',
 'https://218.76.253.201:61408']
Nach dem Login kopieren

之后就可以直接使用了

from ip_pool import http_ip_pool, https_ip_pool
Nach dem Login kopieren

Das obige ist der detaillierte Inhalt vonSo verwenden Sie den Python-Crawler zum stapelweisen Crawlen von IP-Proxys (Code). Für weitere Informationen folgen Sie bitte anderen verwandten Artikeln auf der PHP chinesischen Website!

Verwandte Etiketten:
Quelle:cnblogs.com
Erklärung dieser Website
Der Inhalt dieses Artikels wird freiwillig von Internetnutzern beigesteuert und das Urheberrecht liegt beim ursprünglichen Autor. Diese Website übernimmt keine entsprechende rechtliche Verantwortung. Wenn Sie Inhalte finden, bei denen der Verdacht eines Plagiats oder einer Rechtsverletzung besteht, wenden Sie sich bitte an admin@php.cn
Beliebte Tutorials
Mehr>
Neueste Downloads
Mehr>
Web-Effekte
Quellcode der Website
Website-Materialien
Frontend-Vorlage