The robots.txt Syntax Guide: User-agent, Allow, Disallow, Sitemap

The first time I shipped a robots.txt file by hand, I deindexed a client's blog for nine days. One stray Disallow: / under the wrong group, one upload, and Google quietly dropped every URL. The file is four directives and maybe ten lines, yet it sits at the exact point where a typo costs you traffic. This guide walks through the syntax line by line, clears up the single biggest misconception about what the file actually does, and tells you precisely where to put it.

The four directives that do everything

A robots.txt file is a plain text document made of groups. Each group starts with a User-agent line that names which crawler the rules apply to, followed by Allow and Disallow lines that govern paths. Sitemap lines live outside any group and point to your sitemap.

User-agent — the crawler name this block targets. * is the wildcard catch-all. Googlebot, Bingbot, GPTBot, and ClaudeBot are common specific targets. A crawler reads only the most specific group that matches its name; if GPTBot has its own block, it ignores the * block entirely.
Disallow — a URL path prefix the crawler should not fetch. Disallow: /admin/ blocks everything under /admin/. An empty Disallow: means "block nothing," which is how you explicitly allow a bot.
Allow — a path prefix that overrides a broader Disallow in the same group. This is the carve-out directive: block a folder, then allow one file inside it.
Sitemap — an absolute URL to your XML sitemap. It is not tied to any User-agent group and can appear anywhere in the file. You can list more than one.

Matching follows a "longest match wins" rule for Googlebot and Bingbot: between a Disallow: /downloads/ and an Allow: /downloads/pricing.pdf, the longer, more specific Allow path wins for that one file. This behavior is now codified — the Robots Exclusion Protocol was published as RFC 9309 in 2022, and Google's own robots.txt documentation spells out how its crawlers apply these rules. Before RFC 9309 the protocol was a 1994 informal convention, which is exactly why older or smaller crawlers sometimes implement Disallow but not Allow.

A real file, line by line

Here is a working robots.txt for a content site that wants Google in, training scrapers out, and its sitemap advertised:

User-agent: *
Disallow:

User-agent: GPTBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: *
Disallow: /search/
Allow: /search/help.html

Sitemap: https://example.com/sitemap.xml

Read it top to bottom. The first group allows every crawler everything (Disallow: with nothing after it). The next three groups name specific AI training bots and block them outright with Disallow: /. The fifth group blocks the internal /search/ results — which waste crawl budget and generate thin pages — while keeping /search/help.html reachable through the longer Allow path. The final line points every crawler at the sitemap. That is the entire grammar; there is nothing else to learn.

If hand-assembling groups and remembering which AI user-agents to name sounds tedious, that is the job our robots.txt Generator was built for. It stacks User-agent groups visually, rebuilds the output on every keystroke, and ships presets for "Allow all," "Block all," and a curated "Block AI scrapers" list covering GPTBot, ClaudeBot, CCBot, PerplexityBot, Google-Extended, and the rest of the training crawlers that show up in real server logs.

The misconception that breaks SEO: access versus indexing

Here is the single most important thing to internalize, and it trips up experienced developers: robots.txt is a request, not a wall, and Disallow does not mean "hide from search."

Two separate facts hide inside that sentence. First, the file is advisory. Well-behaved crawlers — Googlebot, Bingbot, GPTBot, ClaudeBot — read it and comply because they choose to. A malicious scraper, a spoofed user-agent, or a bot rotating through IP addresses simply ignores it. There is no enforcement at the protocol level. If you genuinely need to stop a fetch, you need server-side blocks: a firewall rule on the user-agent string, Cloudflare's bot controls, or per-IP rate limits. robots.txt handles the polite majority; it does nothing for the rest.

Second, and more subtle: a Disallowed URL can still appear in Google's index. If another site links to a page you blocked, Google knows the URL exists — it just is not allowed to fetch the contents. The result is a search listing with no description, the dreaded "No information is available for this page." To actually keep a page out of results, you need a <meta name="robots" content="noindex"> tag or the equivalent HTTP header — and the crawler must be allowed to fetch the page to see that directive. So Disallow and noindex are mutually exclusive strategies: use Disallow to save crawl budget on URLs you do not want fetched, and use noindex to keep pages crawlable but out of the index. Blocking a page in robots.txt and expecting it to vanish from search is the most common mistake I see, and it does the opposite of what people intend.

Where the file goes (and where it does not)

Placement is non-negotiable and there is exactly one correct location: the root of the host, served at https://yourdomain.com/robots.txt. Crawlers fetch that precise path and nowhere else. A few rules that follow from this:

It must be at the root, not in a subfolder. https://yourdomain.com/blog/robots.txt is invisible to crawlers; only the host root is checked.
It is per-host and per-protocol. https://example.com/robots.txt does not govern https://shop.example.com/ or https://www.example.com/. Each subdomain needs its own file. So does HTTP versus HTTPS if you serve both.
It must return HTTP 200 with a text/plain content type. A 404 means "crawl everything"; a persistent 5xx error makes Google treat the whole site as disallowed until the file recovers, which is its own quiet disaster.
The filename is lowercase, exactly robots.txt. Robots.txt or robots.TXT will not be found.

One more practical note: Google caches robots.txt for up to 24 hours per host. After you upload a change, expect it to take effect within a few hours, worst case a full day. You can force a refresh by submitting the robots.txt URL through Search Console's URL Inspection tool. When you are also publishing structured metadata for those pages, our meta tag generator pairs naturally with this workflow — robots rules control fetching, meta tags control how the fetched pages present in results.

A short checklist before you upload

Re-read your User-agent: * block and confirm there is no accidental Disallow: /.
Confirm the file is at the host root and returns 200 text/plain.
Add at least one Sitemap line; it costs nothing and smaller crawlers like DuckDuckGo find your sitemap no other way.
Remember that anything you Disallow may still be indexed via backlinks — use noindex for true removal.
Test a few live URLs in Search Console after the cache window passes.

Four directives, one location, one big misconception to avoid. Get those right and robots.txt quietly does its job for years. Get the Disallow: / wrong and you will, like me, spend a tense week watching URLs reappear in the index one slow crawl at a time.

Made by Toolora · Updated 2026-06-13