Web admins need to take care of several aspects of their website to ensure it is getting indexed properly.
And one of the first and foremost things to do, even before you start publishing search engine optimized content, is to ensure your robots.txt file is configured correctly.
No, a perfect robots txt file won’t directly improve your rankings, but it is a critical technical SEO component; if you don’t get it right, it can negatively affect your rankings.
Today, we will explore all the robots txt rules and best practices, how you can use them to manage search engine crawl budgets, block Googlebot and other crawlers from accessing particular pages, get the ranking you deserve on Google, and a lot more—basically everything you need to know about robots.txt files and directives. Let’s start from the basics.
What is a robots txt file?
In simple words, a robots.txt file is an instruction manual for web crawlers and bots that tells them which sections of the website they should and should not crawl.
It is a text file (which is why the txt extension) and is placed at the root of the website.
Some pages on websites need not be crawled and indexed on search engines. For example, the admin login page of your website should not be visible on search engines for security reasons.
You can block such pages from search engines by simply writing a few lines of code into the robots.txt file. The instructions written into the robots.txt file are called robots.txt directives.
Most search engines run a robots.txt check before crawling websites. In case there is no robots.txt file, they will crawl the entire website.
That being said, keep in mind that robots.txt is just a ‘code of conduct’ or a set of instructions. Although most search engines choose to obey the instructions, it is totally up to them whether to follow or ignore the directives. Google tends to obey the directives in most cases.
Why use a robots.txt file?
No, robots.txt files are not compulsory; search engines can find and index all of the pages on your website even if you don’t have a robots.txt file. They can also understand which pages are important and which pages are not, and they do not index irrelevant pages or duplicate content by default.
However, a robots.txt file becomes necessary if your website is large. It gives better control over what search engines do on your website and more importantly, how your website is presented to the search engine traffic. It is always good and highly recommended to have a robots.txt file.
Here are a few benefits of having a robots.txt file:
1. Block irrelevant and non-public pages from search engines
Even if you don’t want them to, search engines can crawl and index all the pages on your website. For example, you don’t want users to see your website when it is still under development, but search engines can still access and index all your website pages.
If you have a robots.txt file, blocking such pages takes only a minute or two.
2. Manage search engine crawl budget
The crawl budget is the number of pages on your website a search engine crawls in a certain period of time. Since search engines need to crawl the entire web and update the SERPs frequently, they crawl only a certain number of pages in a visit.
Your crawl budget depends on the website size, update frequency, internal linking structure, and server speed. Once the crawler uses up your budget or reaches a dead end, it moves on to other websites.
To make the best out of each crawl, you need to manage your crawl budget, especially if your website is large. If you allow bots to access irrelevant and non-public pages, you may exhaust your crawl budget even before the crawler finds your website content that has the highest ranking potential.
You can use robots.txt to block crawlers from accessing irrelevant pages, hence managing your crawl budget efficiently.
3. Prevent indexing of attachment pages
Many content management systems, including WordPress, create separate pages for attachments. It is easy to block search engines from indexing individual pages with the ‘no-index’ meta tag, but it doesn’t work in case of attachment pages.
If you have a robots.txt, a simple line of code will do it for you.
Where to put robots.txt file?
The robots.txt file should always be placed at the root of your domain—if your website is www.example.com, the robots.txt file should be accessible at www.example.com/robots.txt.
Keep in mind that the file name also matters—it should always be named robots.txt in lowercase letters, and the extension must be .txt. Have a look at Startbucks.com’s robots.txt file and the URL:
Now let’s explore how easy it is to understand and create robots.txt files. By the end of the next section, you’ll be able to understand what each line of code in any robots.txt means.
Robots.txt components and rules with examples
All robots.txt files, irrespective of how large they are, have only two components: user-agents (crawler identifier) and directives (rules).
And it always has the same format: it specifies the user-agent first and then the directives specific to that user-agent.
Here’s the standard robots txt format:
User-agent: [crawler identifier 1]
[directive 1]
[directive 2]
User-agent: [crawler identifier 2]
[directive 1]
[directive 2]
And here’s an actual robots.txt file example:
User-agent: *
Disallow: /admin-login
User-agent: Googlebot
Disallow: /not-for-google
User-agent: Yandex
Disallow: /not-for-yandex/
In this example, all crawlers are blocked from accessing the example.com/admin-login page. Google is blocked from accessing both the admin login page and the example.com/not-for-google/ page, while Yandex is blocked from accessing the admin login page as well as the example.com/not-for-yandex/ page.
There are two things you should note from the above example:
● You can set different rules for different web crawlers and bots.
● ‘*’ is used as a wildcard. The directives following the wildcard are applicable for all bots.
Not following? Just wait, we will explain.
User-agent
In simple terms, the user-agent specifies which crawler, browser, device, or software should listen to the rules mentioned in the lines that follow. There are hundreds of user-agents, but for us, only user-agents of search engine robots are relevant.
From an SEO perspective, these are the user-agents you should know:
Search engine | Field | User-agent |
All crawlers | General | * |
General | Googlebot | |
Images | Googlebot-Image | |
Mobile | Googlebot-Mobile | |
News | Googlebot-News | |
Video | Googlebot-Video | |
AdSense | Mediapartners-Google | |
AdWords | AdsBot-Google | |
Bing | General | bingbot |
Bing | General | msnbot |
Bing | Images & Video | msnbot-media |
Bing | Ads | adidxbot |
Yahoo! | General | slurp |
Yandex | General | yandex |
Baidu | General | baiduspider |
Baidu | Images | baiduspider-image |
Baidu | Mobile | baiduspider-mobile |
Baidu | News | baiduspider-news |
Baidu | Video | baiduspider-video |
Now, let’s say you want your entire website indexed by all search engines but Google. Here’s what robots.txt will look like:
User-agent: *
Allow: /
User-agent: Googlebot
Disallow: /
The Allow and Disallow under the user-agent specifications are the directives for those user-agents.
If you haven’t noticed it yet, in the above example, the robots.txt first allows all bots to access the website—including Googlebot—and then disallows Googlebot. It may seem contradictory, but that’s just how robots.txt works. If there are contradictory commands, the bot will follow the more specific or granular command. In this case, Googlebot will ignore the wildcard and follow the command that’s specific to Googlebot.
Directives
Directives are rules you want the user-agent to follow. They define how the search engine bot should crawl your website.
There are a few more directives other than the Disallow and Allow we saw above. It is important to know all of them to configure your robots.txt properly.
Robots txt Disallow directive
The robots.txt Disallow directive tells search engines not to access a certain page or a set of pages that have the same URL structure.
For example, let’s say you want to block the Bing bot from accessing your category pages. The robots.txt syntax for this instruction will look like this:
User-agent: bingbot
Disallow: /category
This command tells Bingbot not to access the pages that have the URL structure www.yourdomain.com/category If you have a category page named case studies that has the URL www.yourdomain.com/category/case-studies, then that page will also be blocked from the Bingbot since they have the same URL structure.
Now, if you leave the space after the Disallow directive blank, the instruction is to access all the pages—it works as a robots txt allow all directive.
Here’s an example:
User-agent: *
Disallow:
If you don’t want any bots or search engines to crawl your website, you can disallow all in robots txt. The code looks like this:
User-agent: *
Disallow: /
Robots txt Allow directive
Since search engines can access all the pages by default, the Allow directive is used to note exceptions of the Disallow directive.
For example, the command to disallow Yandex from crawling the blog pages will look something like this:
User-agent: yandex
Disallow: /blog
Now this will tell Yandex not to access any blog pages/pages that have the URL structure yourdomain.com/blog/blog-post-slug. But if you want Yandex to crawl a specific blog post, you can simply add the Allow directive in the next line.
User-agent: yandex
Disallow: /blog
Allow: /blog/allowed-blog-post
This will tell Yandex to crawl only the allowed blog post and leave the other blog posts alone. As mentioned earlier, crawlers follow the most granular directive.
The Sitemap directive
The Sitemap directive lets you tell search engines where they can access your XML sitemap. This directive is accepted by Google, Bing, Ask, Yahoo, and Yandex.
A sitemap, in case you’re new to SEO, lists the pages you want to be indexed by search engines.
The ideal way to submit an XML sitemap to search engines is through their respective webmaster tools. If you have already submitted it through webmaster tools, adding sitemap to the robots txt might seem redundant, but it will not harm your website.
There are hundreds of web crawlers and bots on the internet, and it is nearly impossible to submit the sitemap to all of them. Having your sitemap mapped in your robots.txt can actually be good for SEO.
Here’s what the Sitemap directive in a robots.txt looks like:
Sitemap: https://yourdomain.com/sitemap.xml
User-agent: *
Disallow: /blog
Allow: /blog/allowed-blog-post
Notice any difference from the other directives? The Sitemap directive should state the absolute URL of the sitemap. Other directives may list the path or the absolute URL while the Sitemap directive always mentions the absolute URL.
You can place the Sitemap directive either at the top or at the bottom of your robots.txt.
Comments
Just like developers add comments between codes for convenience and ease of understanding, you can add comments to the robots.txt. Search engines and bots ignore robots.txt comments.
Simply add a ‘#’ before your comment, and the rest of the text in the line will be ignored by search engines.
The thing is, you don’t and you won’t need it unless you have a really large robots.txt file. There’s no harm in adding comments, so you can use it to remind yourself why your directives exist.
See an example:
# Block website backend from getting indexed on Google.
User-agent: Googlebot
Disallow: /wp-admin/ # Block /wp-admin/ directory.
As you can understand from the above example, the placement of the comment doesn’t matter. Whatever text you add in the line after the ‘#’ is considered as a comment.
Google supports only the directives we’ve discussed so far—User-agent, Disallow, Allow, Sitemap—and comments. But we have a few more directives that are supported by other search engines.
Host
The Host directive lets you tell search engines—specifically Yandex—whether to show the www.example.com or the example.com version of your website.
host: example.com
No, Google does not support this directive, only Yandex does. And for this reason, it is not ideal to use this directive unless you focus exclusively on Yandex SEO or have a valid reason to go for it. Also, the Host directive doesn’t let you specify a protocol scheme (http:// or https://) either.
The ideal way to display a particular hostname on search engines is to 301 redirect the hostname you don’t want to the one you want. You can redirect it by logging into your hosting account. 301 redirect is not search engine specific—it works for all search engines and browsers.
Crawl-delay
Search robots are very powerful crawlers—they can overload your website with too many crawl requests, especially Bing and Yandex. Thankfully, Bing, Yahoo, and Yandex respond to the robots txt crawl-delay directive and slow down the crawl rate when they see this directive in the robots.txt.
However, the crawl-delay directive is just a temporary fix to server overloads; you will have to migrate your website to a better hosting to permanently fix this issue.
Here’s how you can use this directive:
crawl-delay: 10
A crawl-delay value of 10 tells search engines to wait for 10 seconds before two consecutive crawls, which means that they may crawl up to 8640 pages a day.
It is ideal to use this directive when you don’t have much traffic coming from Bing, Yahoo, or Yandex.
Note that Google doesn’t respond to this directive. You can control the crawl rate of Google in Google Search Console settings.
How to create a robots txt file for your website?
You have plenty of options when it comes to creating your own robots.txt. There are several tools that’ll let you generate robots txt files for free, you can either use them or create one on your own.
If you don’t want to write the commands, you can go for SEOptimer’s robots.txt generator or Ryte’s Robots.txt generator. You can select the user-agents and add the URLs of the pages you want to allow and disallow in their respective columns and hit ‘create’ to create a robots.txt for your site. Pretty easy, but the customizations offered by these tools are minimal.
If you can write the commands yourself, then the easiest way is to create a text file on your computer and edit it accordingly. Let’s see how to get it right:
Open Notepad, Microsoft Word, or any other text editor on your computer.
Add your robots.txt commands and directives. If you want to include a sitemap in robots.txt, make sure you add it at the very top or at the bottom.
Save the file as ‘robots.txt’. Make sure you use lowercase letters and choose txt as the file extension.
Now, log in to your cPanel or hosting account and upload this file to the root of your domain. If you’re using WordPress, you should upload it to the public_html folder.
How to check robots.txt for errors?
How would you know if your robots.txt is working properly and haven’t blocked any necessary pages?
You can either use Google Search Console or TechnicalSEO.com’s robots txt tester. With Google’s tester, you can only test if it is working properly with Google’s bots and crawlers.
To test using TechnicalSEO’s tool, go to this link and enter the URL you want to test, choose the user-agent from the drop-down menu, and click on the ‘TEST’ button. The tool will tell you if your robots.txt is working properly.
We will test the robots.txt of Starbucks for the Googlebot smartphone crawler:
You can see the result in the bottom right corner. Saw the warning sign in line 3? Hover over the warning sign; it’ll tell you the reason. In this case, it tells us that Googlebot Smartphone ignores the crawl-delay directive.
Things you must know about robots.txt
Now that you know all the robots.txt user-agents and directives, you can create a valid robots.txt file on your own. However, there are still a few more critical things you must know about robots.txt.
If you don’t know these things, you may encounter some errors. An error in robots.txt is something that you can’t afford from an SEO perspective.
Here you go:
Case sensitivity can cause errors
The names of user-agents and the robots.txt directives are not case sensitive but the file name and the directive values are. For example, if you use Disallow: /case-studies/, search engines will still crawl /Case-studies/.
Also, if you use an uppercase letter in the file name, it just won’t work. All the characters in the filename should be lowercase, and the file extension must be txt.
Disallowing doesn’t guarantee that it won’t be indexed
A page blocked by robots.txt may still get indexed if it is linked from other allowed pages. Such pages will be marked ‘Indexed, though blocked by robots.txt’ in Google Search Console. It is better to use the no-index meta tag in pages you don’t want search engines to index.
Also, note that there is no such thing as robots txt no index. The no-index directive is a meta tag and is to be added to individual pages, not the robots.txt file.
Don’t disallow backlinked pages
If you are blocking a page that has good backlinks from authority websites, think again. The PageRank won’t flow into your website, which essentially means that those backlinks will be useless if you disallow that page.
Have a rule for all the bots
If you use rules for specific user-agents but don’t use a wildcard at all, some bots may not have any rules to follow. It is better to have a fallback block of rules for all bots.
Keep your robots.txt always up to date
It is common to set up the robots.txt during website development and forget about it when launching the website. You may disallow important pages when they’re in development, but remind yourself to allow those pages when you launch your website. If you don’t, all your SEO efforts will be ineffective until you allow search engines to crawl those pages.