How to prohibit crawling in php: first obtain UA information through the "$_SERVER['HTTP_USER_AGENT'];" method; then store the malicious "USER_AGENT" into the array; finally prohibit mainstream collection such as empty "USER_AGENT" program.
Recommended: "PHP Tutorial"
We all know that there are many crawlers on the Internet. Some are useful for website inclusion, such as Baidu Spider, but there are also useless crawlers that not only do not comply with robots rules and put pressure on the server, but also cannot bring traffic to the website, such as Yisou Spider ( Latest addition: Yisou Spider Has been acquired by UC Shenma Search! Therefore, this article has been removed from the ban on Yisou Spider! ==>Related articles). Recently, Zhang Ge discovered that there were a lot of crawling records of Yisou and other garbage in the nginx log, so he compiled and collected various methods on the Internet to prohibit garbage spiders from crawling the website. While setting up his own website, he also provided reference for all webmasters. .
Enter the conf directory under the nginx installation directory and change it as follows Save the code as agent_deny.conf
cd /usr/local/nginx/conf
vim agent_deny.conf
#禁止Scrapy等工具的抓取 if ($http_user_agent ~* (Scrapy|Curl|HttpClient)) { return 403; } #禁止指定UA及UA为空的访问 if ($http_user_agent ~* "FeedDemon|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|CoolpadWebkit|Java|Feedly|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|HttpClient|MJ12bot|heritrix|EasouSpider|Ezooms|^$" ) { return 403; } #禁止非GET|HEAD|POST方式的抓取 if ($request_method !~ ^(GET|HEAD|POST)$) { return 403; }
Then, insert the following code after location / { in the website-related configuration:
include agent_deny.conf;
Such as the configuration of Zhang Ge’s blog:
[marsge@Mars_Server ~]$ cat /usr/local/nginx/conf/zhangge.conf
location / { try_files $uri $uri/ /index.php?$args; #这个位置新增1行: include agent_deny.conf; rewrite ^/sitemap_360_sp.txt$ /sitemap_360_sp.php last; rewrite ^/sitemap_baidu_sp.xml$ /sitemap_baidu_sp.php last; rewrite ^/sitemap_m.xml$ /sitemap_m.php last; 保存后,执行如下命令,平滑重启nginx即可: /usr/local/nginx/sbin/nginx -s reload
Place the following method after the first //Get UA information
$ua = $_SERVER['HTTP_USER_AGENT']; //将恶意USER_AGENT存入数组 $now_ua = array('FeedDemon ','BOT/0.1 (BOT for JCE)','CrawlDaddy ','Java','Feedly','UniversalFeedParser','ApacheBench','Swiftbot','ZmEu','Indy Library','oBot','jaunty','YandexBot','AhrefsBot','MJ12bot','WinHttp','EasouSpider','HttpClient','Microsoft URL Control','YYSpider','jaunty','Python-urllib','lightDeckReports Bot');
//Prohibited Empty USER_AGENT, dedecms and other mainstream collection programs are all empty USER_AGENT, and some sql injection tools are also empty USER_AGENT
if(!$ua) { header("Content-type: text/html; charset=utf-8"); die('请勿采集本站,因为采集的站长木有小JJ!'); }else{ foreach($now_ua as $value ) //判断是否是数组中存在的UA if(eregi($value,$ua)) { header("Content-type: text/html; charset=utf-8"); die('请勿采集本站,因为采集的站长木有小JJ!'); } }
If it is vps, it is very simple, use curl -A to simulate Just crawl, for example:
Simulate Yisou Spider crawling:
curl -I -A 'YisouSpider' zhang.ge
Simulate crawling when UA is empty:
curl -I -A ' ' zhang.ge
Simulate the crawling of Baidu Spider:
curl -I -A 'Baiduspider' zhang.ge
Modify the .htaccess in the website directory and add the following code (2 types Code optional): The screenshots of the three crawling results are as follows:
# It can be seen that the empty return of Yisou Spider and UA is a 403 Forbidden Access logo, while Baidu Spider succeeds Return 200, the description is valid!
①. Garbage collection with empty UA information is intercepted:
②. Banned UAs are intercepted:
Therefore, for the collection of spam spiders, we can analyze the access logs of the website to find out some unseen ones. After the name of the spider is correct, it can be added to the prohibited list of the previous code to prohibit crawling.
The following is a list of common spam UA on the Internet, for reference only, and you are welcome to add to it.
FeedDemon 内容采集 BOT/0.1 (BOT for JCE) sql注入 CrawlDaddy sql注入 Java 内容采集 Jullo 内容采集 Feedly 内容采集 UniversalFeedParser 内容采集 ApacheBench cc攻击器 Swiftbot 无用爬虫 YandexBot 无用爬虫 AhrefsBot 无用爬虫 YisouSpider 无用爬虫(已被UC神马搜索收购,此蜘蛛可以放开!) MJ12bot 无用爬虫 ZmEu phpmyadmin 漏洞扫描 WinHttp 采集cc攻击 EasouSpider 无用爬虫 HttpClient tcp攻击 Microsoft URL Control 扫描 YYSpider 无用爬虫 jaunty wordpress爆破扫描器 oBot 无用爬虫 Python-urllib 内容采集 Indy Library 扫描 FlightDeckReports Bot 无用爬虫 Linguee Bot 无用爬虫
The above is the detailed content of How to set php to prohibit crawling websites. For more information, please follow other related articles on the PHP Chinese website!