Guide to Google Robots.txt File and Robots Exclusion Standard Protocols

By Logicserve News Desk

  • March 2, 2022,
Beginners guide to Robot.txt file

A robots.txt file is basically a text file that webmasters create to instruct the search engine robots which URLs/pages to crawl on a website. This is used to improve the crawl budget for the website.

What is a robots.txt file used for?

A robots.txt file is used to manage crawler traffic to the site and usually to keep a file off Google, depending on the various file type:

File Type Robots.txt effect
Web Page Manages your website crawl budget

Avoids crawling unimportant or similar pages on your site

Pages will still get indexed in search results with no description

Note – If other pages link to your page with descriptive anchor text, Google could still index the page without visiting.

Digital Assets Prevents image, video, and audio files from appearing in Google search results

This will not prevent other pages or users from linking to your image, video, or audio file.

Robots.txt file limitations:

Depending on your goals, consider other mechanisms to ensure your URLs are not findable on the web.

  • txt directives may not be supported by all search engines
  • Search engine crawlers may interpret syntax differently.
  • A page that is disallowed in robots.txt can still be indexed if linked to other sites.

Steps to create a perfect Robots.txt file:

A robots.txt file should be added at the end of any root domain. For instance, if the site URL is www.abc.com, the robots.txt file lives at www.abc.com/robots.txt. robots.txt is a plain text file that follows the Robots Exclusion Standard consisting of one or more rules. This rule allows or blocks access for a mentioned search engine crawler to a specified file path in that website. Unless the rules are specified in the robots.txt file, all files on the site are allowed for crawling.

Simple example for robots.txt with few rules:

User-agent: Googlebot
Disallow: /nogooglebot/
User-agent: *
Allow: /

Sitemap: http://www.abc.com/sitemap.xml

This file means:

  1. Googlebot, i.e., the user agent, cannot crawl any URL that starts with abc.com/nogooglebot/.
  2. All other user agents can crawl the entire site. By default, other user agents are allowed to crawl the entire website even if this rule isn’t added.
  3. The site’s sitemap file is located at http://www.abc.com/sitemap.xml

Formats and Location Rules:

  •  The file name must be robots.txt
  •  A site can have only one robots.txt file
  • txt file needs to be placed at the root of the website host to which it applies. For instance, considering the URL is http://www.abc.com/robots.txt. It cannot be placed in a subdirectory, i.e. http://www.abc.com/page/robots.txt
  • This file can be applied to
  • a subdomain (http://web.abc.com/robots.txt) or on
  • non-standard ports (http://www.abc.com:8181/robots.txt)
  • Google may ignore characters that are not part of the UTF-8 range, potentially rendering robots.txt rules invalid.
  • Adding rules to the robots.txt file

Instructions should be added for search engine crawlers on which parts of your site they can crawl. Check on these guidelines when adding rules to your robots.txt file:

  • A robots.txt file consists of one or more groups
  • Each group consists of multiple guidelines or instructions, one directive per line. Each group begins with a User-agent line specifying the target of the groups. For example:

User-agent: *
Allow: /
Disallow: /search/

  • A group gives the following information:
  • Who the group applies to (the user agent)
  • Which directories or files that user agents can access

Which directories or files that user agent cannot access

  • Search engine crawlers process groups from top to bottom
  • By default, the user agent can crawl any page or directory not blocked by a disallow rule
  • Rules are case-sensitive. For example, disallow: /login.asp applies to https://www.abc.com/login.asp, but not https://www.abc.com/LOGIN.asp

The # character marks the starting of a comment
Google’s crawlers support the subsequent directives in robots.txt files:
User-agent:

Each search engine must identify itself with a User-agent. This defines the start of a group of directives. Usage of an asterisk (*) matches all search engine crawlers except for the different AdsBot crawlers, which must be named explicitly.

For example:
# Example 1: Block all but AdsBot crawlers
User-agent: *
Disallow: /

# Example 2: Block only Googlebot
User-agent: Googlebot
Disallow: /

# Example 3: Block Googlebot and Adsbot
User-agent: Googlebot
User-agent: AdsBot-Google
Disallow: /

disallow:
At least one or more disallow entries per rule. A directory or page to the root domain that you do not want the user agent to crawl. If a robot’s rule refers to a URL, it must be the entire URL path as visible in the browser. If it refers to a directory, it must start with a / character and end with the / mark. For example:

#Disallow: Page and directory

User-agent: *
Disallow: /login.html
Disallow: /search/
allow:

At least one or more allow entries per rule. A directory or page to the root domain may be crawled by the user agent just mentioned. Allow rule is used to override a disallow directive to allow crawling of a subdirectory or page in a disallowed directory.

For a single page, specify the entire URL path. To add a directory, end the rule with a forward slash ‘/’ mark.

For example:

#Allow: Page and directory, and override
User-agent: *
Allow: /login.html
Allow: /loan/
Disallow: /search/
Allow: /search/

sitemap:

One or more xml sitemaps can be declared in robots.txt. Sitemaps are an effective way to indicate which content Google should crawl, as opposed to which content it can or cannot crawl. For example:

Sitemap: https://www.abc.com/sitemap-images.xml

Sitemap: https://www.abc.com/sitemap-index.xml

Upload the robots.txt file

Once the robots.txt file is created, you can make it available for web crawlers. Contact your hosting company to upload the file as it depends on your site and server architecture.

Test robots.txt markup

Test whether your uploaded robots.txt file is publicly accessible, open the browsing window and navigate to the location https://www.abc.com/robots.txt

Google offers testing robots.txt in Search Console. You can only use this tool for robots.txt files that are already accessible on your site.

Submit robots.txt file to Google

Google’s crawlers certainly find and start using your robots.txt file once uploaded to the domain root folder. No action is required here.

In case you update the robots.txt file, it is essential that you refresh Google’s cached copy.

Useful robots.txt rules

Here are a few common useful robots.txt rules:

Useful Rules Code Comment
Disallow crawling of the entire website User-agent: *

Disallow: /

URLs from the website may still be indexed, even if they have not been crawled.
Disallow crawling of an entire site, but allow Mediapartners-Google User-agent: *

Disallow: /

 

User-agent: Mediapartners-Google

Allow: /

Hides your pages from search results.

Mediapartners-Google web crawlers can still examine them to decide which ads to show visitors on your site.

Disallow crawling of a directory and its contents User-agent: *

Disallow: /search/

Disallow: /about-us/

Disallow: /about-us/archive/

Add a forward slash to the directory name to disallow the crawling of a whole directory.

The disallowed string may appear anywhere in the URL path,

so, Disallow: /about-us/ matches both

https://abc.com/about-us/

https://abc.com/why-us/other/about-us/

Allow access to a single crawler User-agent: Googlebot-news

Allow: /

 

User-agent: *

Disallow: /

Only googlebot-news can crawl the entire site
Allow access to all except for a single crawler User-agent: Unnecessarybot

Disallow: /

 

User-agent: *

Allow: /

All bots can crawl the site except for Unnecessarybot
Disallow crawling of a single web page User-agent: *

Disallow: /login_file.html

Bots can crawl all web pages on your site, except for the login_file.html page
Disallow crawling of files from a specific file type User-agent: Googlebot

Disallow: /*.gif$

 

User-agent: *

Disallow: /*loan/

Adding * prior to the string/file extension will disallow files/paths containing a specific string or file type

 

Disallow for crawling all .gif files

Block all images on your site from Google Images User-agent: Googlebot-Image

Disallow: /

Google cannot index images and videos without crawling them
Block a specific image from Google Images User-agent: Googlebot-Image

Disallow: /images/logo.jpg

This will disallow the logo.jpg file to be crawled
Use $ to match URLs that end with a specific string User-agent: Googlebot

Disallow: /*.pdf$

This will disallow all pdf files from crawling

 

Learn about how Google interprets robots.txt specifications in the coming chapters for robots.

Related Blogs