What is the Robots.txt?

SEO

4Quick Read

Author:

Author:

robots.txt is a text file located at the root of a website that gives instructions to web crawlers about which pages, folders, or URLs they can crawl and which ones they cannot. Its purpose is to help search engines crawl your site more correctly. It is not a security tool, and it is not used to block a page from Google results on its own.

What is the Robots.txt?
No headings found on page

In practice, robots.txt is useful when you want to keep bots from wasting time on areas with no SEO value, like admin sections, login pages, internal search, parameter-based filters, or staging paths. On larger websites, this can help Google and other bots focus more on the important content on your site.

What is the robots.txt file?

robots.txt is part of the Robots Exclusion Protocol, the protocol that defines how web crawlers read a site's instructions before they start crawling. Essentially, it is the set of rules that the robots.txt file is based on. This protocol is now officially standardized through RFC 9309, which defines the basic syntax and how its directives are interpreted.

The file has to be in the right place, meaning at the root of the host. For example:

https://www.example.com/robots.txt

If you put it in a subfolder, it won't apply to the whole site the way you expect. Google explicitly says that robots.txt has to be located at the root of the host it applies to.

What is robots.txt used for?

robots.txt is used to control crawling. With it, you can tell bots to avoid URLs that you don't want to spend crawl resources on, like admin pages, URLs with filters, search pages, temporary test sections, or areas with no organic value. Its function is tied to more efficient management of crawler traffic, not guaranteed removal of content from the index.

In simple terms, robots.txt says, "don't crawl this path." It does not say, "definitely remove it from Google." That's the most basic distinction you need to understand from the start.

Why is it important for SEO?

robots.txt is important for SEO because it can help search engines focus on the pages that have the most value. When a site has a lot of low-value URLs, like filters, sort parameters, or internal search results, using robots.txt correctly can reduce unnecessary crawling and make the site's overall technical structure cleaner.

That doesn't mean every website needs a complicated robots.txt. On smaller, simpler sites, a very lean file or just a sitemap declaration may be enough. Its value increases mainly in more complex architectures. This approach also matches the direction of the official guidelines, which treat it as a management tool and not as a required "SEO hack."

How do I create a robots.txt file?

Creating robots.txt is simple. Open a text editor, write your rules, save the file as robots.txt in plain text format, and upload it to the root of your host. Google recommends that the file be plain text and that you avoid formatted documents that may add unwanted characters.

A simple example looks like this:

User-agent: *

Disallow: /wp-admin/

Allow: /wp-admin/admin-ajax.php

Sitemap: https://www.example.com/sitemap.xml

The example above tells all bots not to crawl /wp-admin/, but to allow admin-ajax.php, while also declaring the XML sitemap. The core directives User-agent, Disallow, Allow, and Sitemap are among the most common and expected in robots.txt files.

What are the basic robots.txt directives?

The most basic directive is User-agent, which identifies which bot the rules are for. * means "all bots." The Disallow directive says which path should not be crawled, e.g. duplicate content, while Allow is used to permit a more specific path even if the parent folder has been blocked. Sitemap declares the location of the sitemap for easier discovery.

Examples:

User-agent: *

Disallow: /private/

User-agent: Googlebot

Allow: /private/public-file.pdf

Sitemap: https://www.example.com/sitemap.xml

The standard and the official guidelines agree on the basic logic of these directives, although not all search engines support every non-basic rule in the same way.

robots.txt example for WordPress

On a typical WordPress site, a safe and usually sufficient example is the following:

User-agent: *

Disallow: /wp-admin/

Allow: /wp-admin/admin-ajax.php

Sitemap: https://www.example.com/sitemap.xml

The logic here is simple: block the admin area from unnecessary crawling without breaking functions needed for the front end. In many cases, the better practice is not to "load up" robots.txt with dozens of lines, but to keep it as clean and targeted as possible.

robots.txt example for Shopify

On a typical Shopify store, you usually don't need to create a brand-new robots.txt from scratch, because the platform already provides a default robots.txt file that's optimized for SEO. But if you want to make changes, you do that through the robots.txt.liquid file.

Where does robots.txt go?

robots.txt has to go in the root of the host. For example, if you want to control crawling for https://www.example.com/, the file has to be at https://www.example.com/robots.txt. It can't be placed in a random subfolder and still be expected to work everywhere.

Also, robots.txt is applied at the host and protocol level. One robots.txt may apply to the main domain and another to a subdomain, like shop.example.com. The file's location is technically critical to correct application.

How do I check whether robots.txt is working correctly?

The first check is to open /robots.txt in your browser and see whether the file loads correctly, in plain text form, and from the right location. After that, you need to confirm that you haven't accidentally blocked important URLs and that the sitemap URL is correct.

To check whether the robots.txt file is error-free, there are several methods, with the best-known ones being the following.

You also need to check whether the file is clean, whether it has correct syntax, and whether it isn't too large. Google mentions a limit of 500 KiB for robots.txt; content beyond that limit may be ignored.

Can I block pages from Google with robots.txt?

Not in the way most people imagine. robots.txt can prevent crawling, but it does not guarantee that a page won't appear in search results. Google explains that if it knows about a URL from links or other sources, it may still show it even if the content couldn't be crawled because of robots.txt.

So if your goal is "to keep a page out of the index," robots.txt is not the right tool. In that case, you need a different solution, like noindex or X-Robots-Tag, depending on the type of content.

What is the difference between robots.txt and noindex?

The difference is critical. robots.txt controls whether a bot can crawl a URL. noindex controls whether a page should stay out of the index. These two functions are not the same and should not be confused.

If you block a page with robots.txt and expect the search engine to see the noindex, then you've created a contradiction. You may have stopped the bot before it could read the instruction that would tell it not to index the page. That's why robots.txt should not be used as a substitute for noindex.

Robots.txt vs noindex vs canonical vs X-Robots-Tag

If you want to control crawling, use robots.txt. If you want to control indexing, use noindex. If you want to consolidate signals between similar or duplicate URLs, use canonical. Otherwise, if you want to give robots instructions for non-HTML files like PDFs or at the HTTP header level, use X-Robots-Tag.

Google explicitly documents meta robots and X-Robots-Tag for these cases.

One of the most important decision frameworks in technical SEO: one tool for crawl control, another for index control, another for canonicalization. The more clearly you separate them, the fewer technical mistakes you'll make.

Does robots.txt protect sensitive content?

No. The official standard is clear that robots.txt does not constitute a "form of access authorization". In other words, it's not a security system. If a page is publicly accessible and someone knows the URL, robots.txt by itself does not protect it.

If you want real content protection, you need access control, login, password protection, or some other server-side restriction. Google says the same thing very clearly: robots.txt is not the right solution for keeping information secure.

Which pages are worth blocking in robots.txt?

It is worth blocking pages and paths that have no organic value and that you don't want to consume crawl resources. Usually, this category includes admin areas, login pages, carts, checkouts, internal search result pages, parameter-heavy filter pages, test sections, and staging environments. These use cases are among the most logical and common in technical SEO.

But you don't need to block "everything just to be safe." The right robots.txt is based on intent and purpose. If a URL has SEO value or needs to render correctly and be understood by search engines, blocking it can do more harm than good.

Should I put a sitemap in robots.txt?

Yes, it's a very good practice to declare the XML sitemap in robots.txt. It's not the only way a sitemap can be found, but it is a clean and useful signal to crawlers. The Sitemap directive is widely supported and documented in the relevant guidelines.

Example:

Sitemap: https://www.example.com/sitemap.xml

On a technically well-organized site, having a proper sitemap declaration in robots.txt is considered almost a given.

Does every website need robots.txt?

Not necessarily. Google presents it as a tool used when there is a need to control crawler access or crawler traffic. If you have a small site, without a complex structure, without filters, without staging areas, and without technical sections that create noise, you may not need a complex robots.txt at all.

Often, a simple file or even just the site's proper technical structure and XML sitemap are enough. Complexity should only be added when there is a real reason.

robots.txt for WordPress, e-commerce, blogs, and staging sites

On a WordPress site, the most common use case is to limit crawling of /wp-admin/ and allow files needed for the front end, like admin-ajax.php. That's a classic and practical scenario.

On an e-commerce site, robots.txt is often used to limit crawling of filter combinations, sorting URLs, carts, checkout pages, and other non-commercial paths that create noise. The strategy there is to protect crawl efficiency and avoid filling the system with low-value parameter URLs.

On a blog or news site, robots.txt is usually simpler. The goal is not to accidentally block sections that need crawling and to clearly declare the sitemap. On these sites, oversimplifying is often better than overconfiguring.

In staging or dev environments, robots.txt can be used to limit crawling, but it is not enough as a protection method. Because it is not a security mechanism, a staging site that should not be public needs to be protected with authentication or some other access control.

How are Allow and Disallow rules interpreted?

The standard does not say that "the last line wins." The logic is more specific: the rule that makes the most specific match to the path wins. If there are equivalent matches, then Allow takes precedence.

Example:

User-agent: *

Disallow: /private/

Allow: /private/public-file.pdf

In the example, the /private/ folder is blocked, but the specific file is allowed because the Allow rule is more specific.

Does robots.txt support wildcards?

Yes. The standard describes the use of special characters like * for wildcard matching and $ to indicate the end of a pattern. They're useful in advanced scenarios, especially when you want to handle parameter URLs or specific file types.

Example:

Disallow: /*?sort=

Disallow: /*.pdf$

The first blocks URLs that contain a sort parameter, and the second blocks URLs that end in .pdf. But you need to be careful, because more complex patterns also mean a greater chance of mistakes.

Does it matter if I write /Admin/ instead of /admin/?

Yes, it can. The RFC defines path matching in a way that makes the correct path format important. In practice, a mistake in uppercase and lowercase letters can lead to a rule that doesn't match the way you think it does. In environments where paths are case-sensitive, that's a big deal.

Rules should be written exactly in the same form the site itself uses in its URLs. Small syntax mistakes in robots.txt can have disproportionately large effects.

What about special characters and international URLs?

RFC 9309 also describes the comparison logic for characters that are not plain ASCII or that require percent-encoding. It's a more specialized topic, but it becomes important on multilingual sites, legacy systems, and URLs with special characters.

You may not run into it often on simple sites. But on international projects, handling encoded paths and special characters correctly is part of a truly solid technical SEO setup.

How quickly do robots.txt changes take effect?

Changes don't always take effect right away. The RFC describes behavior around caching, while Google's guidelines explain that robots.txt can be cached and may not be re-read instantly every time it changes. A fix to the file may take a little time before it affects crawling.

So when you make an important change, it isn't enough to just upload a new file. You also need to check that it loads correctly, that it doesn't return an error, and that the new version is actually the one the crawler sees.

What happens if robots.txt returns an error?

If robots.txt returns access problems or instability, that can affect crawler behavior. RFC 9309 and the documentation from search engines recognize that fetching, caching, and error handling are essential parts of the protocol.

That means a robots.txt that fails, redirects for no reason, or is in the wrong location can create technical uncertainty. That's why correct hosting, stable responses, and clean access to /robots.txt are basic requirements.

Common robots.txt problems and how to fix them

The most serious mistake is accidentally blocking the entire site with something like this:

User-agent: *

Disallow: /

That tells bots not to crawl anything. It's a common mistake in staging-to-live migrations and can go unnoticed if no technical check is done after launch.

A second common problem is that Google keeps showing URLs you've blocked. That does not necessarily mean robots.txt "isn't working." It may simply mean that the URL is still known from links or other sources and that a crawling block is not the same as deindexing. In that case, the right tool is noindex or another appropriate index-control method.

Another common mistake is blocking a page with noindex through robots.txt and hoping it will be removed from the index. That can keep the search engine from seeing the directive itself. The correct solution is to leave crawl access open temporarily so the noindex can be read.

Finally, old migration rules, too many directives, bad file encoding, or a misplaced sitemap often create problems. The simpler and better maintained robots.txt is, the safer it becomes.

Do all engines support the same directives?

No. Google clearly says that not all search engines support the same robots.txt rules, and practical documentation differs between Google, Bing, and Yandex. That means you should not assume that every search engine behaves the same way everywhere.

So when you're writing robots.txt for an international or multi-engine environment, the safest approach is to rely on the basic directives and know what each platform specifically supports.

What about Bing and crawl-delay?

Bing has documented crawl-delay and also mentions ways to adjust crawl speed through its tools. That's an important difference from Google, which doesn't handle robots.txt with exactly the same approach for directives like that.

That doesn't mean every site needs crawl-delay. In fact, even Bing presents it as something to use when there is a real server-load issue. For most websites, overconfiguring these parameters isn't necessary.

What is Clean-param and when does it matter?

Clean-param is a directive associated with Yandex and is used for URLs that contain parameters which don't materially affect the content, such as UTM-style tracking tags. Yandex documents it explicitly as a way to handle these cases.

When to use robots.txt and when not to

Use robots.txt when you want to limit crawling in sections with no SEO value or in technical paths that shouldn't consume crawl resources. Don't use it when your goal is to permanently remove a page from the index or protect sensitive content. For those cases, you can use other tools, such as noindex, X-Robots-Tag, or access control.

Best practices for a proper robots.txt

A proper robots.txt usually has the following characteristics to be optimal:

  • it's small and clean

  • it has a clear purpose

  • it contains only what is needed

  • it doesn't accidentally block important sections

  • it includes the sitemap

  • it is not used as a security tool

  • it does not confuse crawling with indexing

The clearer its role is, the more useful it becomes for your site.

Frequently asked questions about robots.txt

Can I hide pages from Google with robots.txt?

No. robots.txt controls crawl access, not the final indexing status of a URL.

Where do I upload robots.txt?

At the root of the host, meaning at /robots.txt.

Is robots.txt required?

No. It's useful when there is a need to control crawling, but it's not required for every site.

Should I put a sitemap in robots.txt?

Yes, it's considered a very good practice.

Does robots.txt protect private content?

No. It's not a security or authorization mechanism.

What's the difference between robots.txt and meta robots?

robots.txt controls crawling, while meta robots is used for page-level directives like noindex.

Can I have a different robots.txt on a subdomain?

Yes, because robots.txt applies at the host level.

What does Disallow: / mean?

It means you're asking bots not to crawl anything under that host.

When do I use Allow?

When you want to allow a more specific path within a broader blocked section.

Should I block CSS and JavaScript?

Not without a reason, because the crawler may need access to resources for proper rendering and page understanding. The safe strategy is to block only what has a clear reason not to be crawled.

Conclusion

robots.txt is one of the most useful, but also one of the most misunderstood, files in technical SEO. When used correctly, it helps manage crawling, reduce wasted crawler traffic, and better organize complex websites. When used incorrectly, it can block valuable pages, confuse the indexing strategy, and create technical problems that aren't obvious right away.

The best mindset is to treat it as a crawl control tool, not a security tool or a deindexing tool. Keep it simple, strategic, and technically correct. That's when robots.txt stops being a "small technical detail" and becomes part of a truly advanced SEO strategy.

Enter your email

We only send useful articles and research.