What Directives are Used in the Robots Exclusion Protocol?

Written by Oleg Tyshchenko

How to use the "User-agent"?

Here are some of the most common user agents

  • * – The rules apply to every bot, unless there is a more specific set of rules
  • Googlebot – All Google crawlers
  • Googlebot-Image – Crawler for Google Images
  • Googlebot-News – Crawler for Google News
  • Googlebot-Video – Crawler for Googlebot Video
  • Mediapartners-Google – Google Adsense crawler
  • FeedFetcher-Google - Feedfetcher is used for crawling RSS or Atom feeds for Google Podcasts, Google News, and PubSubHubbub
  • Bingbot – Bing’s crawler
  • Yandex – Yandex’s crawler
  • YandexBot - The main indexing robot of Yandex
  • YandexImages - Indexes images to display them in Yandex.Images
  • YandexMedia - Indexes multimedia data for Yandex’s crawler
  • Baiduspider – Baidu’s crawler

Examples of using user agents

User-agent: Googlebot
Disallow: /admin/

User-agent: Bingbot
Disallow: /private/

User-agent: AhrefsBot
Disallow: /

In this example, the directives apply to three different user agents: Googlebot, Bingbot, and AhrefsBot. The first "Disallow" directive instructs Googlebot not to crawl the "/admin/" directory, but allows it to access the "/private/" and other pages of site. The second "Disallow" directive instructs Bingbot not to crawl the "/private/" directory, but allows it to access the "/admin/" and other pages. The last "Disallow" directive instructs AhrefsBot not to crawl all pages of site.

Other robots (Sogou, SemrushBot, MJ12bot, and so on) can index the entire site.

What does "User-agent: *" mean?

"User-agent: *" is a directive in a robots.txt file that applies to all web robots and search engine crawlers.

a web robot or search engine crawler reads a robots.txt file and encounters the "User-agent: *" directive, it applies the following directives to all user agents, unless there is a more specific directive that applies to a particular user agent.

For example, the following robots.txt file allows all web robots and search engine crawlers to crawl all pages of a website:

User-agent: *
Disallow:

In this example, the "User-agent: *" directive applies to all user agents, and the "Disallow:" directive allows all pages of the website to be crawled and indexed.

However, it's important to note that the "User-agent: *" directive can be overridden by more specific directives that apply to particular user agents. For example, the following robots.txt file allows all user agents to crawl all pages of a website except for the pages in the "/admin/" directory:

User-agent: *
Disallow: /admin/

In this example, the "User-agent: *" directive applies to all user agents, but the "Disallow: /admin/" directive overrides the previous directive and disallows the "/admin/" directory to be crawled and indexed by all user agents.

How do "Disallow" commands work in a robots.txt file?

The "Disallow" directive in a robots.txt file is used to instruct web robots and search engine crawlers not to crawl or index specific pages or directories on a website.

The syntax of the "Disallow" directive is as follows:

User-agent: [user-agent name]
Disallow: [URL path]

The "[user-agent name]" specifies the user agent to which the directive applies. If the user agent name is "*", the directive applies to all user agents.

The "[URL path]" specifies the URL path that the user agent is not allowed to access. The URL path can be a single file, a directory, or a pattern of files or directories using the wildcard character "*".

For example, the following robots.txt files disallows all user agents from accessing the various directories and files.

Example 1

User-agent: *
Disallow: /admin/

In this example, the "Disallow: /admin/" directive applies to all user agents and specifies that the "/admin/" directory and all of its subdirectories should not be crawled or indexed.

Thus will be closed
/admin/
/admin/login.php

Will not be closed
/cms/admin/
/administrator/

Example 2

Disallow: /*admin/

This will be closed
/admin/
/admin/login.php
/cms/admin/

Will not be closed
/administrator/

Example 3

Disallow: /*admin*/

This will be closed
/admin/
/admin/login.php
/cms/admin/
/administrator/

Example 4

Disallow: /*admin*/$

This will be closed
/admin/
/cms/admin/
/administrator/

Will not be closed
/admin/login.php

It's important to note that web robots and search engine crawlers are not required to obey the "Disallow" directive in a robots.txt file. Some web robots and crawlers may still access the pages and directories that are disallowed by the directive. Additionally, the "Disallow" directive only applies to well-behaved web robots and search engine crawlers that follow the Robots Exclusion Protocol, and not to malicious web robots that may ignore the directive altogether.

How do "Allow" commands work in a robots.txt file?

In addition to the "Disallow" directive, the Robots Exclusion Protocol also includes several other directives that can be used to control the behavior of web robots and search engine crawlers:

"Allow": This directive is used to override a previous "Disallow" directive for a specific URL path or directory. It specifies that the user agent is allowed to access the specified URL path or directory, even if it was previously disallowed.

User-agent: Googlebot
Disallow: /admin/
Allow: /admin/login.php

In this example, the "Disallow" directive instructs Googlebot not to crawl the "/admin/" directory, but allows it to access the "/admin/login.php" page.

Why use the "Crawl-delay" directive?

The Crawl-delay directive is used to specify the amount of time that a search engine crawler should wait between requests to a website. It can be used to reduce the load on the server caused by frequent requests from a web robot or crawler.

It's important to note that not all search engines support the Crawl-delay directive, and even those that do may not always adhere to the specified delay time. In addition, some search engines may use alternative methods to regulate the crawl rate of their crawlers, such as the Google Search Console crawl rate setting. Therefore, it's generally a good idea to use other methods, such as optimizing your server and website for efficient crawling, to manage crawl rates and avoid overloading your website.

What is the Sitemaps protocol? Why is it included in robots.txt?

The Sitemaps protocol is a method used by website owners to inform search engines about the pages on their site that are available for crawling and indexing. The Sitemaps protocol allows website owners to submit a file (usually in XML format) to search engines that contains a list of all the URLs on their site, along with additional information such as when each URL was last modified and how frequently it is updated.

The Sitemaps protocol is included in the robots.txt file as a way to tell search engines where the sitemap file is located on the website. This is done using the "Sitemap" directive, which specifies the URL of the sitemap file. By including the sitemap URL in the robots.txt file, website owners can make it easier for search engines to find and crawl all the pages on their site.

Here's an example of a robots.txt file that includes the "Sitemap" directive:

User-agent: *
Disallow:
Sitemap: https://example.com/sitemap.xml

In this example, the "User-agent" directive applies to all user agents, and the "Disallow" directive allows all pages of the website to be crawled and indexed. The "Sitemap" directive specifies the location of the website's sitemap file, which is located at https://example.com/sitemap.xml.

By including the Sitemaps protocol in the robots.txt file, website owners can ensure that their site is fully crawled and indexed by search engines, which can lead to improved search engine rankings and increased traffic to their site.