为了避免爬虫被封 IP ,网上搜索教程说需要建立代理池。但是付费代理都好贵。。。不过好在网上已经有不少免费提供代理的网站了。因此,我打算写个爬虫去获取这些免费 IP ~
策略步骤
用种子关键词例如“代理 IP ”在各个搜索引擎上搜索,获取候选 URL
爬取候选 URL ,将代理地址储存下来
验证代理地址,将可用的代理地址放入代理池
难点
如何去验证维护这些代理地址
如何知道哪些代理地址适合哪些网站(可用性,响应时间)
效率问题(之前写过简单的验证脚本,但是效率非常低)
小伙伴们有不有什么好的办法能解决这些问题呢?
Let’s probably write it down. I happened to have done the same job before, and I also needed an agent at that time. Then I wrote my own crawler to do automatic retrieval and automatic update.
As for the proxy address, I didn’t let the crawler choose the website by itself. Instead, I manually screened several websites that provided free proxies and then wrote some crawlers to crawl different proxy websites;
In response to the difficulty you mentioned:
For verification, the address crawled for the first time will be directly verified to see if it is available. If it can be used, it will be stored in the database or persisted. Because of the unreliability of the proxy, it is necessary to regularly check whether the captured proxy is available. I started it directly on the uWSGI server. Create a scheduled task, which will be checked every half hour and a new agent will be captured every hour. Of course, you can also use scheduled tasks such as crontab;
Directly use the captured proxy to access the website you need to visit. If you need to provide different proxies for different websites, you can verify and store the relevant verification information together;
Efficiency issues are easy to deal with. Network verification operations are all I/O-intensive tasks, which can be solved with coroutines, multi-threads, and multi-processes. Python's gil does not affect multi-threading and improves the efficiency of I/O-intensive tasks
multithreading-spider I used multi-threading + queue to make a simple proxy crawler before. The demo of src is a specific example. It uses a simple producer-consumer model. The crawler that crawls to the proxy address is regarded as a producer to verify the availability of the proxy. The crawler acts as a consumer and can display specific task progress.
You can try this, a Python-based proxy pool.
Automatically capture proxy resources on the Internet and facilitate expansion.
https://github.com/WiseDoge/P...
You can take a look at this project: https://github.com/jhao104/pr...
Open source proxy pool service