The full name of Robots protocol (also known as crawler protocol, robot protocol, etc.) is "Robots Exclusion Protocol". Websites tell search engines which pages can be crawled through Robots protocol. , which pages cannot be crawled. This article will introduce the crawler protocol robots in detail
The full name of Robots protocol is "Robots Exclusion Protocol". Its function is to tell search engines which pages can be crawled and which pages cannot be crawled through Robots files. Fetching, fetching standards, etc. It is placed in the root directory of the website in the form of a text file, which can be modified and edited with any common text editor. For webmasters, writing the robots.txt file reasonably can make more reasonable use of search engines, block some low-quality pages, and improve the quality of the website and its friendliness to search engines.
The specific writing method is as follows:
(* is a wildcard character)
Disallow:/ab/adc.html Disallows crawling of the adc.html file under the ab folder. Allow: /cgi-bin/ The definition here is to allow crawling of directories under the cgi-bin directory Allow: /tmp The definition here is to allow crawling of the entire directory of tmpAllow: .htm$ Only allows access to URLs with the suffix ".htm". Allow: .gif$ allows crawling web pages and gif format imagesSitemap: Sitemap tells the crawler that this page is a sitemapOverview Robots A .txt file is a text file that is the first file that search engines look at when visiting a website. The robots.txt file tells the spider what files can be viewed on the serverWhen a search spider visits a site, it will first check whether robots.txt exists in the root directory of the site. If it exists, The search robot will determine the scope of access based on the contents of the file; if the file does not exist, all search spiders will be able to access all pages on the website that are not password protected[Principle]
The Robots protocol is a common code of ethics in the international Internet community. It is established based on the following principles: 1. Search technology should serve human beings, while respecting the wishes of information providers and maintaining their privacy rights; 2. Websites have the obligation to protect their users’ personal information and privacy from infringement [Note] robots.txt must be placed in the root directory of a site, and the file name must be all lowercase Writing[User-agent] In the following code, * represents all search engine types. * is a wildcard character, indicating all searches. Robot
User-agent: *
User-agent: Baiduspider
Disallow: /admin/
Disallow: /.jpg$
Disallow:/ab/adc.html
Disallow: /*?*
Disallow: /
Allow: .html$
Allow: /tmp
User-agent: *Allow: /
User-agent: *Disallow: /
User-agent: Baiduspider Disallow: /
User-agent: *Disallow: /cgi-bin/Disallow: /tmp/Disallow: /~joe/
[Myth 1]: All files on the website need to be crawled by spiders, so there is no need to add the robots.txt file. Anyway, if the file does not exist, all search spiders will be able to access all pages on the website that are not password protected by default
Whenever a user attempts to access a non-existent URL, the server will record 404 in the log Error (file cannot be found). Whenever a search spider looks for a robots.txt file that does not exist, the server will also record a 404 error in the log, so a robots.txt
should be added to the website [Misunderstanding 2]: In robots All files in the .txt file can be crawled by search spiders, which can increase the indexing rate of the website. Even if the program scripts, style sheets and other files in the website are indexed by spiders, they will not increase the indexing rate of the website. The inclusion rate will only waste server resources. Therefore, it must be set in the robots.txt file not to allow search spiders to index these files
The above is the detailed content of Introduction to crawler protocol robots. For more information, please follow other related articles on the PHP Chinese website!