SEO Blog
robots.txt illustration

How to Create a Robots txt That’s Perfect for SEO

Web admins need to take care of several aspects of their website to ensure it is getting indexed properly.

And one of the first and foremost things to do, even before you start publishing search engine optimized content, is to ensure your robots.txt file is configured correctly.

No, a perfect robots txt file won’t directly improve your rankings, but it is a critical technical SEO component; if you don’t get it right, it can negatively affect your rankings.

Today, we will explore all the robots txt rules and best practices, how you can use them to manage search engine crawl budgets, block Googlebot and other crawlers from accessing particular pages, get the ranking you deserve on Google, and a lot more—basically everything you need to know about robots.txt files and directives. Let’s start from the basics.

What is a robots txt file?

In simple words, a robots.txt file is an instruction manual for web crawlers and bots that tells them which sections of the website they should and should not crawl.

It is a text file (which is why the txt extension) and is placed at the root of the website.

Some pages on websites need not be crawled and indexed on search engines. For example, the admin login page of your website should not be visible on search engines for security reasons.

You can block such pages from search engines by simply writing a few lines of code into the robots.txt file. The instructions written into the robots.txt file are called robots.txt directives.

Most search engines run a robots.txt check before crawling websites. In case there is no robots.txt file, they will crawl the entire website.

That being said, keep in mind that robots.txt is just a ‘code of conduct’ or a set of instructions. Although most search engines choose to obey the instructions, it is totally up to them whether to follow or ignore the directives. Google tends to obey the directives in most cases.

Why use a robots.txt file?

No, robots.txt files are not compulsory; search engines can find and index all of the pages on your website even if you don’t have a robots.txt file. They can also understand which pages are important and which pages are not, and they do not index irrelevant pages or duplicate content by default.

However, a robots.txt file becomes necessary if your website is large. It gives better control over what search engines do on your website and more importantly, how your website is presented to the search engine traffic. It is always good and highly recommended to have a robots.txt file.

Here are a few benefits of having a robots.txt file:

1. Block irrelevant and non-public pages from search engines

Even if you don’t want them to, search engines can crawl and index all the pages on your website. For example, you don’t want users to see your website when it is still under development, but search engines can still access and index all your website pages.

If you have a robots.txt file, blocking such pages takes only a minute or two.

2. Manage search engine crawl budget

The crawl budget is the number of pages on your website a search engine crawls in a certain period of time. Since search engines need to crawl the entire web and update the SERPs frequently, they crawl only a certain number of pages in a visit.

Your crawl budget depends on the website size, update frequency, internal linking structure, and server speed. Once the crawler uses up your budget or reaches a dead end, it moves on to other websites.

To make the best out of each crawl, you need to manage your crawl budget, especially if your website is large. If you allow bots to access irrelevant and non-public pages, you may exhaust your crawl budget even before the crawler finds your website content that has the highest ranking potential.

You can use robots.txt to block crawlers from accessing irrelevant pages, hence managing your crawl budget efficiently.

3. Prevent indexing of attachment pages

Many content management systems, including WordPress, create separate pages for attachments. It is easy to block search engines from indexing individual pages with the ‘no-index’ meta tag, but it doesn’t work in case of attachment pages.

If you have a robots.txt, a simple line of code will do it for you.

Where to put robots.txt file?

The robots.txt file should always be placed at the root of your domain—if your website is www.example.com, the robots.txt file should be accessible at www.example.com/robots.txt.

Keep in mind that the file name also matters—it should always be named robots.txt in lowercase letters, and the extension must be .txt. Have a look at Startbucks.com’s robots.txt file and the URL:

Now let’s explore how easy it is to understand and create robots.txt files. By the end of the next section, you’ll be able to understand what each line of code in any robots.txt means.

Robots.txt components and rules with examples

All robots.txt files, irrespective of how large they are, have only two components: user-agents (crawler identifier) and directives (rules).

And it always has the same format: it specifies the user-agent first and then the directives specific to that user-agent.

Here’s the standard robots txt format:

User-agent: [crawler identifier 1]
[directive 1]
[directive 2]

User-agent: [crawler identifier 2]
[directive 1]
[directive 2]

And here’s an actual robots.txt file example:

User-agent: *
Disallow: /admin-login

User-agent: Googlebot
Disallow: /not-for-google

User-agent: Yandex
Disallow: /not-for-yandex/

In this example, all crawlers are blocked from accessing the example.com/admin-login page. Google is blocked from accessing both the admin login page and the example.com/not-for-google/ page, while Yandex is blocked from accessing the admin login page as well as the example.com/not-for-yandex/ page.

There are two things you should note from the above example:

● You can set different rules for different web crawlers and bots.
● ‘*’ is used as a wildcard. The directives following the wildcard are applicable for all bots.

Not following? Just wait, we will explain.

User-agent

In simple terms, the user-agent specifies which crawler, browser, device, or software should listen to the rules mentioned in the lines that follow. There are hundreds of user-agents, but for us, only user-agents of search engine robots are relevant.

From an SEO perspective, these are the user-agents you should know:

Search engine Field User-agent
All crawlers General *
Google General Googlebot
Google Images Googlebot-Image
Google Mobile Googlebot-Mobile
Google News Googlebot-News
Google Video Googlebot-Video
Google AdSense Mediapartners-Google
Google AdWords AdsBot-Google
Bing General bingbot
Bing General msnbot
Bing Images & Video msnbot-media
Bing Ads adidxbot
Yahoo! General slurp
Yandex General yandex
Baidu General baiduspider
Baidu Images baiduspider-image
Baidu Mobile baiduspider-mobile
Baidu News baiduspider-news
Baidu Video baiduspider-video

Now, let’s say you want your entire website indexed by all search engines but Google. Here’s what robots.txt will look like:

User-agent: *

Allow: /

User-agent: Googlebot

Disallow: /

The Allow and Disallow under the user-agent specifications are the directives for those user-agents.

If you haven’t noticed it yet, in the above example, the robots.txt first allows all bots to access the website—including Googlebot—and then disallows Googlebot. It may seem contradictory, but that’s just how robots.txt works. If there are contradictory commands, the bot will follow the more specific or granular command. In this case, Googlebot will ignore the wildcard and follow the command that’s specific to Googlebot.

Directives

Directives are rules you want the user-agent to follow. They define how the search engine bot should crawl your website.

There are a few more directives other than the Disallow and Allow we saw above. It is important to know all of them to configure your robots.txt properly.

Robots txt Disallow directive

The robots.txt Disallow directive tells search engines not to access a certain page or a set of pages that have the same URL structure.

For example, let’s say you want to block the Bing bot from accessing your category pages. The robots.txt syntax for this instruction will look like this:

User-agent: bingbot
Disallow: /category

This command tells Bingbot not to access the pages that have the URL structure www.yourdomain.com/category If you have a category page named case studies that has the URL www.yourdomain.com/category/case-studies, then that page will also be blocked from the Bingbot since they have the same URL structure.

Now, if you leave the space after the Disallow directive blank, the instruction is to access all the pages—it works as a robots txt allow all directive.

Here’s an example:

User-agent: *
Disallow:

If you don’t want any bots or search engines to crawl your website, you can disallow all in robots txt. The code looks like this:

User-agent: *
Disallow: /

Robots txt Allow directive

Since search engines can access all the pages by default, the Allow directive is used to note exceptions of the Disallow directive.

For example, the command to disallow Yandex from crawling the blog pages will look something like this:

User-agent: yandex
Disallow: /blog

Now this will tell Yandex not to access any blog pages/pages that have the URL structure yourdomain.com/blog/blog-post-slug. But if you want Yandex to crawl a specific blog post, you can simply add the Allow directive in the next line.

User-agent: yandex
Disallow: /blog
Allow: /blog/allowed-blog-post

This will tell Yandex to crawl only the allowed blog post and leave the other blog posts alone. As mentioned earlier, crawlers follow the most granular directive.

The Sitemap directive

The Sitemap directive lets you tell search engines where they can access your XML sitemap. This directive is accepted by Google, Bing, Ask, Yahoo, and Yandex.

A sitemap, in case you’re new to SEO, lists the pages you want to be indexed by search engines.

The ideal way to submit an XML sitemap to search engines is through their respective webmaster tools. If you have already submitted it through webmaster tools, adding sitemap to the robots txt might seem redundant, but it will not harm your website.

There are hundreds of web crawlers and bots on the internet, and it is nearly impossible to submit the sitemap to all of them. Having your sitemap mapped in your robots.txt can actually be good for SEO.

Here’s what the Sitemap directive in a robots.txt looks like:

Sitemap: https://yourdomain.com/sitemap.xml

User-agent: *
Disallow: /blog
Allow: /blog/allowed-blog-post

Notice any difference from the other directives? The Sitemap directive should state the absolute URL of the sitemap. Other directives may list the path or the absolute URL while the Sitemap directive always mentions the absolute URL.

You can place the Sitemap directive either at the top or at the bottom of your robots.txt.

Comments

Just like developers add comments between codes for convenience and ease of understanding, you can add comments to the robots.txt. Search engines and bots ignore robots.txt comments.

Simply add a ‘#’ before your comment, and the rest of the text in the line will be ignored by search engines.

The thing is, you don’t and you won’t need it unless you have a really large robots.txt file. There’s no harm in adding comments, so you can use it to remind yourself why your directives exist.

See an example:

# Block website backend from getting indexed on Google.
User-agent: Googlebot
Disallow: /wp-admin/ # Block /wp-admin/ directory.

As you can understand from the above example, the placement of the comment doesn’t matter. Whatever text you add in the line after the ‘#’ is considered as a comment.

Google supports only the directives we’ve discussed so far—User-agent, Disallow, Allow, Sitemap—and comments. But we have a few more directives that are supported by other search engines. 

Host

The Host directive lets you tell search engines—specifically Yandex—whether to show the www.example.com or the example.com version of your website.

host: example.com

No, Google does not support this directive, only Yandex does. And for this reason, it is not ideal to use this directive unless you focus exclusively on Yandex SEO or have a valid reason to go for it. Also, the Host directive doesn’t let you specify a protocol scheme (http:// or https://) either.

The ideal way to display a particular hostname on search engines is to 301 redirect the hostname you don’t want to the one you want. You can redirect it by logging into your hosting account. 301 redirect is not search engine specific—it works for all search engines and browsers.

Crawl-delay

Search robots are very powerful crawlers—they can overload your website with too many crawl requests, especially Bing and Yandex. Thankfully, Bing, Yahoo, and Yandex respond to the robots txt crawl-delay directive and slow down the crawl rate when they see this directive in the robots.txt.

However, the crawl-delay directive is just a temporary fix to server overloads; you will have to migrate your website to a better hosting to permanently fix this issue.

Here’s how you can use this directive:

crawl-delay: 10

A crawl-delay value of 10 tells search engines to wait for 10 seconds before two consecutive crawls, which means that they may crawl up to 8640 pages a day.

It is ideal to use this directive when you don’t have much traffic coming from Bing, Yahoo, or Yandex.

Note that Google doesn’t respond to this directive. You can control the crawl rate of Google in Google Search Console settings.

How to create a robots txt file for your website?

You have plenty of options when it comes to creating your own robots.txt. There are several tools that’ll let you generate robots txt files for free, you can either use them or create one on your own.

If you don’t want to write the commands, you can go for SEOptimer’s robots.txt generator or Ryte’s Robots.txt generator. You can select the user-agents and add the URLs of the pages you want to allow and disallow in their respective columns and hit ‘create’ to create a robots.txt for your site. Pretty easy, but the customizations offered by these tools are minimal.

If you can write the commands yourself, then the easiest way is to create a text file on your computer and edit it accordingly. Let’s see how to get it right:

Open Notepad, Microsoft Word, or any other text editor on your computer.

Add your robots.txt commands and directives. If you want to include a sitemap in robots.txt, make sure you add it at the very top or at the bottom.

Save the file as ‘robots.txt’. Make sure you use lowercase letters and choose txt as the file extension.

Now, log in to your cPanel or hosting account and upload this file to the root of your domain. If you’re using WordPress, you should upload it to the public_html folder.

How to check robots.txt for errors?

How would you know if your robots.txt is working properly and haven’t blocked any necessary pages?

You can either use Google Search Console or TechnicalSEO.com’s robots txt tester. With Google’s tester, you can only test if it is working properly with Google’s bots and crawlers.

To test using TechnicalSEO’s tool, go to this link and enter the URL you want to test, choose the user-agent from the drop-down menu, and click on the ‘TEST’ button. The tool will tell you if your robots.txt is working properly.

We will test the robots.txt of Starbucks for the Googlebot smartphone crawler:

You can see the result in the bottom right corner. Saw the warning sign in line 3? Hover over the warning sign; it’ll tell you the reason. In this case, it tells us that Googlebot Smartphone ignores the crawl-delay directive.

Things you must know about robots.txt

Now that you know all the robots.txt user-agents and directives, you can create a valid robots.txt file on your own. However, there are still a few more critical things you must know about robots.txt.

If you don’t know these things, you may encounter some errors. An error in robots.txt is something that you can’t afford from an SEO perspective.

Here you go:

Case sensitivity can cause errors

The names of user-agents and the robots.txt directives are not case sensitive but the file name and the directive values are. For example, if you use Disallow: /case-studies/, search engines will still crawl /Case-studies/.

Also, if you use an uppercase letter in the file name, it just won’t work. All the characters in the filename should be lowercase, and the file extension must be txt.

Disallowing doesn’t guarantee that it won’t be indexed

A page blocked by robots.txt may still get indexed if it is linked from other allowed pages. Such pages will be marked ‘Indexed, though blocked by robots.txt’ in Google Search Console. It is better to use the no-index meta tag in pages you don’t want search engines to index.

Also, note that there is no such thing as robots txt no index. The no-index directive is a meta tag and is to be added to individual pages, not the robots.txt file.

Don’t disallow backlinked pages

If you are blocking a page that has good backlinks from authority websites, think again. The PageRank won’t flow into your website, which essentially means that those backlinks will be useless if you disallow that page.

Have a rule for all the bots

If you use rules for specific user-agents but don’t use a wildcard at all, some bots may not have any rules to follow. It is better to have a fallback block of rules for all bots.

Keep your robots.txt always up to date

It is common to set up the robots.txt during website development and forget about it when launching the website. You may disallow important pages when they’re in development, but remind yourself to allow those pages when you launch your website. If you don’t, all your SEO efforts will be ineffective until you allow search engines to crawl those pages.

Click to rate this page!
[Total: 0 Average: 0]

how useful was this post?

Click on a star to rate it!

Frequently asked questions

FAQ

What is robots txt used for?

A robots.txt is used to tell bots and search engine crawlers which URLs they should index and which URLs they should not. Search engines can choose to follow or ignore the robots.txt, but most search engines, including Google, follow robots.txt directives.

Where can I find robots txt?

The robots.txt must be located at the root of the domain. You can find your robots.txt file at yourdomain.com/robots.txt.

Does my website need a robots txt file?

Even if you don’t have a robots.txt, your website will be fine. But having a robots txt file gives more flexibility and control over which pages of your site search engines can crawl and index.

What happens if robots txt is missing?

If your website doesn’t have a robots.txt file, search engine bots will crawl and index all of your website pages. This may include irrelevant pages, resource pages, and the admin login page.

Is robot txt good for SEO?

Yes, having a robots.txt file is good for SEO. It helps you block irrelevant/under-development pages from search engines and control which pages of your website get indexed on search engines.

How do I add robots txt to Webflow?

To add robots txt to Webflow, go to Project Settings→ SEO→ Indexing, add the robots txt rules you want to use, and click on the ‘Save Changes’ button in the top right corner.

How do I optimize a robots txt file?

Robots txt best practices for SEO:

1. Make sure you are disallowing only irrelevant pages and all the important pages are allowed in robots.txt.

2. When you don’t want a page to be indexed at all, such as pages with sensitive data, make sure you either protect it with a password or use the no-index meta directive.

3. It’s better to add sitemap to robots txt. Use the Sitemap directive either at the top or at the bottom of your robots.txt.

4. Some search engines have multiple crawlers. When blocking a page for a search engine, make sure you mention all the user-agents of that search engine.

5. Submit your robots.txt URL in Google Search Console and to other search engines whenever you make major changes to your robots.txt.

Should a sitemap be in robots txt?

It is good to have your sitemap mentioned in your robots.txt since it helps search engines locate your sitemap, thereby understanding the pages you want them to index.

What should I disallow in robots txt?

Any pages that are irrelevant for search engine traffic or have sensitive data should be disallowed in robots.txt. You can block resource pages and non-public pages as well.

How to stop bots from crawling my site?

To disallow all robots from crawling your site, you can use the ‘robots.txt disallow all’ command in your robots.txt file. Here’s what it looks like:

User-agent: *

Disallow: /

Note that this command will block all crawlers and bots that respect the robots.txt directives, including Google, Bing, and Yahoo. None of your website pages will be indexed even if you add sitemap to robots.txt or Google Search Console.

To stop a specific bot, replace the User-agent value with the identifier of that bot.

Can a page that's disallowed in robots txt still be indexed?

Yes, a page that’s disallowed in robots.txt can still be indexed if it is linked from other pages on your website or other websites. Also, search engines can choose to ignore the robots.txt rules. To entirely block a page from search engines, you should block it in the robots.txt and use the no-index meta directive on the page you want to block.

Does Baidu respect robots.txt?

Yes, Baidu respects robots.txt rules and will not crawl, index, or show the content of pages blocked by robots.txt.

What is crawl delay in robots txt?

Crawl-delay in robots.txt tells search engines to wait for a specific period of time before two consecutive crawls. Search engines can crawl your website rigorously, which may overload your server. The crawl delay directive is helpful in this case but is supported by only Yahoo, Bing, and Yandex.

How do I test if my robots txt file blocks Google crawlers?

To test your robots.txt, you can use the Google robots.txt tester tool. Once you select your property, you can see if there are any errors or warnings.

To check if a page is blocked from Google crawlers, just paste the URL in the box at the bottom and click on the ‘TEST’ button. It’ll tell you whether it’s allowed or not.

However, note that Google robots txt tester works only for Google crawlers. To test if a page is allowed for other search engines, you can use this tool.