A robots.txt file is basically a text file that webmasters create to instruct the search engine robots which URLs/pages to crawl on a website. This is used to improve the crawl budget for the website.
What is a robots.txt file used for?
A robots.txt file is used to manage crawler traffic to the site and usually to keep a file off Google, depending on the various file type:
|File Type||Robots.txt effect|
|Web Page||Manages your website crawl budget
Avoids crawling unimportant or similar pages on your site
Pages will still get indexed in search results with no description
Note – If other pages link to your page with descriptive anchor text, Google could still index the page without visiting.
|Digital Assets||Prevents image, video, and audio files from appearing in Google search results
This will not prevent other pages or users from linking to your image, video, or audio file.
Robots.txt file limitations:
Depending on your goals, consider other mechanisms to ensure your URLs are not findable on the web.
- txt directives may not be supported by all search engines
- Search engine crawlers may interpret syntax differently.
- A page that is disallowed in robots.txt can still be indexed if linked to other sites.
Steps to create a perfect Robots.txt file:
A robots.txt file should be added at the end of any root domain. For instance, if the site URL is www.abc.com, the robots.txt file lives at www.abc.com/robots.txt. robots.txt is a plain text file that follows the Robots Exclusion Standard consisting of one or more rules. This rule allows or blocks access for a mentioned search engine crawler to a specified file path in that website. Unless the rules are specified in the robots.txt file, all files on the site are allowed for crawling.
Simple example for robots.txt with few rules:
This file means:
- Googlebot, i.e., the user agent, cannot crawl any URL that starts with abc.com/nogooglebot/.
- All other user agents can crawl the entire site. By default, other user agents are allowed to crawl the entire website even if this rule isn’t added.
- The site’s sitemap file is located at http://www.abc.com/sitemap.xml
Formats and Location Rules:
- The file name must be robots.txt
- A site can have only one robots.txt file
- txt file needs to be placed at the root of the website host to which it applies. For instance, considering the URL is http://www.abc.com/robots.txt. It cannot be placed in a subdirectory, i.e. http://www.abc.com/page/robots.txt
- This file can be applied to
- a subdomain (http://web.abc.com/robots.txt) or on
- non-standard ports (http://www.abc.com:8181/robots.txt)
- Google may ignore characters that are not part of the UTF-8 range, potentially rendering robots.txt rules invalid.
- Adding rules to the robots.txt file
Instructions should be added for search engine crawlers on which parts of your site they can crawl. Check on these guidelines when adding rules to your robots.txt file:
- A robots.txt file consists of one or more groups
- Each group consists of multiple guidelines or instructions, one directive per line. Each group begins with a User-agent line specifying the target of the groups. For example:
- A group gives the following information:
- Who the group applies to (the user agent)
- Which directories or files that user agents can access
Which directories or files that user agent cannot access
- Search engine crawlers process groups from top to bottom
- By default, the user agent can crawl any page or directory not blocked by a disallow rule
- Rules are case-sensitive. For example, disallow: /login.asp applies to https://www.abc.com/login.asp, but not https://www.abc.com/LOGIN.asp
The # character marks the starting of a comment
Google’s crawlers support the subsequent directives in robots.txt files:
Each search engine must identify itself with a User-agent. This defines the start of a group of directives. Usage of an asterisk (*) matches all search engine crawlers except for the different AdsBot crawlers, which must be named explicitly.
# Example 1: Block all but AdsBot crawlers
# Example 2: Block only Googlebot
# Example 3: Block Googlebot and Adsbot
At least one or more disallow entries per rule. A directory or page to the root domain that you do not want the user agent to crawl. If a robot’s rule refers to a URL, it must be the entire URL path as visible in the browser. If it refers to a directory, it must start with a / character and end with the / mark. For example:
#Disallow: Page and directory
At least one or more allow entries per rule. A directory or page to the root domain may be crawled by the user agent just mentioned. Allow rule is used to override a disallow directive to allow crawling of a subdirectory or page in a disallowed directory.
For a single page, specify the entire URL path. To add a directory, end the rule with a forward slash ‘/’ mark.
#Allow: Page and directory, and override
One or more xml sitemaps can be declared in robots.txt. Sitemaps are an effective way to indicate which content Google should crawl, as opposed to which content it can or cannot crawl. For example:
Upload the robots.txt file
Once the robots.txt file is created, you can make it available for web crawlers. Contact your hosting company to upload the file as it depends on your site and server architecture.
Test robots.txt markup
Test whether your uploaded robots.txt file is publicly accessible, open the browsing window and navigate to the location https://www.abc.com/robots.txt
Google offers testing robots.txt in Search Console. You can only use this tool for robots.txt files that are already accessible on your site.
Submit robots.txt file to Google
Google’s crawlers certainly find and start using your robots.txt file once uploaded to the domain root folder. No action is required here.
In case you update the robots.txt file, it is essential that you refresh Google’s cached copy.
Useful robots.txt rules
Here are a few common useful robots.txt rules:
|Disallow crawling of the entire website||User-agent: *
|URLs from the website may still be indexed, even if they have not been crawled.|
|Disallow crawling of an entire site, but allow Mediapartners-Google||User-agent: *
|Hides your pages from search results.
Mediapartners-Google web crawlers can still examine them to decide which ads to show visitors on your site.
|Disallow crawling of a directory and its contents||User-agent: *
|Add a forward slash to the directory name to disallow the crawling of a whole directory.
The disallowed string may appear anywhere in the URL path,
so, Disallow: /about-us/ matches both
|Allow access to a single crawler||User-agent: Googlebot-news
|Only googlebot-news can crawl the entire site|
|Allow access to all except for a single crawler||User-agent: Unnecessarybot
|All bots can crawl the site except for Unnecessarybot|
|Disallow crawling of a single web page||User-agent: *
|Bots can crawl all web pages on your site, except for the login_file.html page|
|Disallow crawling of files from a specific file type||User-agent: Googlebot
|Adding * prior to the string/file extension will disallow files/paths containing a specific string or file type
Disallow for crawling all .gif files
|Block all images on your site from Google Images||User-agent: Googlebot-Image
|Google cannot index images and videos without crawling them|
|Block a specific image from Google Images||User-agent: Googlebot-Image
|This will disallow the logo.jpg file to be crawled|
|Use $ to match URLs that end with a specific string||User-agent: Googlebot
|This will disallow all pdf files from crawling|
Learn about how Google interprets robots.txt specifications in the coming chapters for robots.