如何准确判断请求是搜索引擎爬虫(蜘蛛)发出的请求
网站经常会被各种爬虫光顾,有的是搜索引擎爬虫,有的不是,通常情况下这些爬虫都有UserAgent,而我们知道UserAgent是可以伪装的,UserAgent的本质是Http请求头中的一个选项设置,通过编程的方式可以给请求设置任意的UserAgent。
所以通过UserAgent判断请求的发起者是否是搜索引擎爬虫(蜘蛛)的方式是不靠谱的,更靠谱的方法是通过请求者的ip对应的host主机名是否是搜索引擎自己家的host的方式来判断。
要获得ip的host,在windows下可以通过nslookup命令,在linux下可以通过host命令来获得,例如:
这里我在windows下执行了nslookup ip 的命令,从上图可以看到这个ip的主机名是crawl-66-249-64-119.googlebot.com。 这说明这个ip是一个google爬虫,google爬虫的域名都是 xxx.googlebot.com.
我们也可以通过python程序的方式来获得ip的host信息,代码如下:
import socket def getHost(ip): try: result=socket.gethostbyaddr(ip) if result: return result[0], None except socket.herror,e: return None, e.message
上述代码使用了socket模块的gethostbyaddr的方法获得ip地址的主机名。
常用蜘蛛的域名都和搜索引擎官网的域名相关,例如:
百度的蜘蛛通常是baidu.com或者baidu.jp的子域名
google爬虫通常是googlebot.com的子域名
微软bing搜索引擎爬虫是search.msn.com的子域名
搜狗蜘蛛是crawl.sogou.com的子域名
基于以上原理,我写了一个工具页面提供判断ip是否是真实搜索引擎的工具页面,该页面上提供了网页判断的工具和常见的google和bing的搜索引擎爬虫的ip地址。
附带常见搜索引擎蜘蛛的IP段:
蜘蛛名称 | IP地址 |
---|---|
Baiduspider |
202.108.11.* 220.181.32.* 58.51.95.* 60.28.22.* 61.135.162.* 61.135.163.* 61.135.168.* |
YodaoBot |
202.108.7.215 202.108.7.220 202.108.7.221 |
Sogou web spider |
219.234.81.* 220.181.61.* |
Googlebot |
203.208.60.* |
Yahoo! Slurp |
202.160.181.* 72.30.215.* 74.6.17.* 74.6.22.* |
Yahoo ContentMatch Crawler |
119.42.226.* 119.42.230.* |
Sogou-Test-Spider |
220.181.19.103 220.181.26.122 |
Twiceler |
38.99.44.104 64.34.251.9 |
Yahoo! Slurp China |
202.160.178.* |
Sosospider | 124.115.0.* |
CollapsarWEB qihoobot |
221.194.136.18 |
NaverBot |
202.179.180.45 |
Sogou Orion spider |
220.181.19.106 220.181.19.74 |
Sogou head spider |
220.181.19.107 |
SurveyBot |
216.145.5.42 64.246.165.160 |
Yanga WorldSearch Bot v |
77.91.224.19 91.205.124.19 |
baiduspider-mobile-gate |
220.181.5.34 61.135.166.31 |
discobot |
208.96.54.70 |
ia_archiver | 209.234.171.42 |
msnbot |
65.55.104.209 65.55.209.86 65.55.209.96 |
sogou in spider |
220.181.19.216 |
ps:https协议网页能够被搜索引擎收录吗
百度现在只能收录少部分的https,大部分的https网页无法收录。
不过我查询了google资料,Google能够比较好地收录https协议的网站。
所以如果你的网站是中文的,而且比较关注搜索引擎自然排名流量这块,建议尽量不要将所有内容都放到https中去加密去。
可考虑的方式是:
1、对于需要加密传递的数据,使用https,比如用户登录以及用户登录后的信息;
2、对于普通的新闻、图片,建议使用http协议来传输;
3、网站首页建议使用http协议的形式。

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics



It's easy to change the search engine in Safari, Google Chrome, or other browsers on your iPhone or iPad. This tutorial will show you how to do it on four different web browsers available on iPhone and iPad. How to Change the Safari Search Engine on iPhone or iPad Safari is the default web browser on iOS and iPadOS, but you might not like the search engine. Fortunately, you can use these steps to change it: On your iPhone or iPad, launch Settings from the Home screen. Swipe down and tap Safari from the list. In the next menu,

Baidu Cloud is a software that allows users to store many files. So what is the entrance to Baidu Cloud Disk search engine? Users can enter the URL https://pan.baidu.com to enter Baidu Cloud Disk. This sharing of the latest entrance to Baidu Cloud Disk search engine will give you a detailed introduction. The following is a detailed introduction. Take a look. . Baidu cloud disk search engine entrance 1. Qianfan search website: https://pan.qianfan.app Supports network disk: aggregate search, Alibaba, Baidu, Quark, Lanzuo, Tianyi, Xunlei network disk viewing method: login required, follow the company Advantages of obtaining the activation code: The network disk is comprehensive, there are many resources, and the interface is simple. 2. Maolipansou website: alipansou.c

Java development: How to implement search engine and full-text retrieval functions, specific code examples are required Search engines and full-text retrieval are important functions in the modern Internet era. Not only do they help users find what they want quickly, they also provide a better user experience for websites and apps. This article will introduce how to use Java to develop search engines and full-text retrieval functions, and provide some specific code examples. Full-text search using Lucene library Lucene is an open source full-text search engine library, developed by ApacheSo

PHP Search Engine Performance Optimization: Algolia’s Magical Way With the development of the Internet and the increasing user requirements for search experience, search engine performance optimization has become crucial. In the world of PHP development, Algolia is a powerful and easy-to-integrate search engine service. This article will introduce the magical uses of Algolia and how to optimize the performance of PHP search engines through Algolia. Algolia introduction Algolia is a search engine service provider based on SaaS model.

Since its launch late last year, ChatGPT has been seen as a major threat to traditional ways of searching for information. Because it is diverse, you can answer people's questions, write essays or poems, or even write program code. The ability of conversational AI to provide coherent answers is considered a threat to Google's search engine, which for decades has been the benchmark platform for people to search for information on the Internet. OpenAI’s ChatGPT can tailor answers to specific questions asked by users, which can save time browsing websites. A report published by The New York Times in December revealed that ChatGPT’s overnight success forced Google to call it “Code Red” and begin addressing the threat posed by artificial intelligence chatbots to its search engine business. according to

How to change the search engine in Google Chrome? Google Chrome is a very popular browser among users. It not only has simple and easy-to-use services, practical tools and other auxiliary functions, but also can meet the different needs of different users. Search engines generally default to Google. If we want to How should I set it up to replace it? Let me share the method below. Replacement method 1. Click to open Google Chrome. 2. Click the three-dot icon to open the menu interface. 3. Click the Settings option to enter the browser’s settings interface. 4. Find the search engine module in the settings interface. 5. Click the Manage Search Engine button. 6. You can see an add button. Click this add button to add a search engine.

With the continuous development of the information age, people increasingly rely on the Internet to obtain information. As one of the platforms for information sharing, web search engines are also constantly evolving and improving. This article will introduce how to implement a full-text search engine in PHP7.0, helping readers make better use of PHP technology and quickly build an efficient search engine. 1. Overview of full-text search engines Full-text search uses keywords or phrases to search throughout the document to find the most matching results. Full-text search engines use algorithms to index documents to speed up searches. exist

Google Chrome is very good. There are many friends who use it. Many friends want to use Google’s own search engine, but don’t know how to use it. Here is a quick look at how to use Google Chrome’s Google search engine. Bar. How to use the Google search engine in Google Chrome: 1. Open Google Chrome and click More in the upper right corner to open settings. 2. After entering settings, click "Search Engine" on the left. 3. Check whether your search engine is "Google". 4. If not, you can click the drop-down button and change it to "Google".
