With the rapid growth of Internet information, Web crawlers have become a very important tool. They can retrieve information on the web, crawl data on websites, and are an important part of data collection and analysis. The popularity of anti-crawler technology puts crawlers at risk of being banned.
When crawling data, website owners can resist web crawlers in a variety of ways, restricting and hindering crawler programs, such as setting access frequency limits, verification codes, IP blocking, etc. Of course, these strategies are not 100% effective, and many people can still use proxy services to circumvent these defenses. Recently, a new anti-crawler tool has appeared in the crawler industry, called Crawlera. It is an open source framework for crawlers that focuses on agency issues.
Scrapy is a popular web crawler framework written in Python. Scrapy is based on the Twisted framework and uses asynchronous processing to improve the efficiency of crawlers. In Scrapy crawler, using Crawlera as a proxy server can solve the anti-crawler problem well. This article describes how to use the Crawlera proxy server in Scrapy to crawl data from a specific website.
First, you need to create a Crawlera account. You can apply for a Crawlera account on the official website and obtain an API key. Next, you can start setting up Scrapy.
In the settings.py file, add the following code snippet to enable Crawlera middleware:
CRAWLERA_ENABLED = True CRAWLERA_APIKEY = '<Your-API-KEY>' DOWNLOADER_MIDDLEWARES = { 'scrapy_crawlera.CrawleraMiddleware': 610 }
where <Your-API-KEY>
should be replaced with yours Crawlera API key. Pay attention to the value of the middleware, as this determines the order in which the middleware is executed. Scrapy middleware is executed in numerical order, so it is important to place Crawlera after other middleware.
You can now run the crawler and see if Crawlera is used successfully. The command to start the crawler is:
scrapy crawl <spider-name>
If it is started successfully, you can see the following output in the terminal window:
2017-04-11 10:26:29 [scrapy.utils.log] INFO: Using Crawlera proxy <http://proxy.crawlera.com:8010>: tor-exit-crawlera
Using the Crawlera proxy server, the crawler needs to be paid to use. Crawlera provides two billing methods: Bandwidth billing and Request billing. For the Bandwidth billing method, the bandwidth usage per second determines the payment amount. The Request billing method is based on the total number of crawler requests. You can choose one of the methods according to your actual needs.
It is also worth mentioning that Crawlera also comes with load balancing and high availability features. These features can help you take advantage of multiple proxy servers and avoid the failure of a single proxy server. Another benefit of using Crawlera proxy servers is that they take Scrapy's asynchronous requests/concurrency into account.
In short, Crawlera is one of the key factors for Scrapy to successfully crawl websites, and it is undoubtedly a very effective anti-crawler solution. By using Crawlera, you can crawl data stably while saving time and energy.
The above is the detailed content of Use Crawlera in Scrapy crawler to solve anti-crawling problems. For more information, please follow other related articles on the PHP Chinese website!