llms.txt vs robots.txt vs ai.txt: AI Crawler Guide

Three Files, Three Jobs, and the Confusion Tax

If you've spent any time in operator Slacks or marketing newsletters lately, you've probably been told to "add an llms.txt" the same way people once told you to add a sitemap. The advice is usually short on detail and short on accuracy. Some of it suggests llms.txt will get you cited in ChatGPT. Some of it implies it controls crawling. Neither is true.

Three files with similar-sounding names showed up over the past few years, each solving a different problem:

robots.txt controls whether a crawler can fetch your pages at all. It's been around since 1994 and has actual teeth, in the sense that legitimate operators honor it.
ai.txt is a permission and licensing statement aimed at AI training. It tells operators what you do and don't consent to. It does not block anything.
llms.txt is a curated index for AI coding agents and similar tools. It tells a developer agent which docs matter and where to find them. It is not a crawl directive and not a citation request.

Mixing these up is costly. Block the wrong bot and you lose visibility in AI Overviews. Trust the wrong file to stop training and you end up in someone's dataset anyway. Add llms.txt because a blog told you it boosts rankings, and you've added maintenance overhead for zero ranking signal.

robots.txt for AI Crawlers: What Actually Works in 2026

robots.txt is the only one of the three files that has broad, deliberate support from the major AI crawler operators. OpenAI, Anthropic, Google, Meta, Common Crawl, Perplexity, and Apple all publish user-agent strings and instructions for blocking them via robots.txt. The compliance isn't legally binding, but the major operators do follow the directive in practice, and getting caught violating it tends to be a PR disaster.

Here's the user-agent menu you actually need to know about in 2026:

Bot Name	Operator	Purpose	Disallow Directive
GPTBot	OpenAI	Training data for ChatGPT	`User-agent: GPTBot`
OAI-SearchBot	OpenAI	Indexing for ChatGPT search results	`User-agent: OAI-SearchBot`
ChatGPT-User	OpenAI	User-initiated fetches (browsing)	`User-agent: ChatGPT-User`
ClaudeBot	Anthropic	Training data for Claude	`User-agent: ClaudeBot`
Claude-SearchBot	Anthropic	Indexing for Claude search	`User-agent: Claude-SearchBot`
Google-Extended	Google	Training for Gemini and Vertex AI	`User-agent: Google-Extended`
CCBot	Common Crawl	Open web archive, fed into many models	`User-agent: CCBot`
Meta-ExternalAgent	Meta	Training data for Llama and Meta AI	`User-agent: Meta-ExternalAgent`
Bytespider	ByteDance	Training data for TikTok and Doubao	`User-agent: Bytespider`
PerplexityBot	Perplexity	Indexing for Perplexity Answers	`User-agent: PerplexityBot`
Applebot-Extended	Apple	Training for Apple Intelligence	`User-agent: Applebot-Extended`

A few things worth understanding before you start blocking:

Training vs. fetching are different jobs. GPTBot trains the model. ChatGPT-User fetches a page when a user explicitly asks ChatGPT to read it. Block GPTBot but not ChatGPT-User, and you opt out of training while staying readable when users send your link to ChatGPT.

Search bots are separate. OAI-SearchBot and PerplexityBot crawl for retrieval, not training. Blocking them removes you from those products' search results. If you care about being cited in ChatGPT or Perplexity, leave those bots alone.

Google-Extended is opt-out for Gemini training only. Disallowing it doesn't affect regular Googlebot or your Google Search ranking. It's a separate user agent specifically so publishers can opt out of training without losing search traffic.

A reasonable starter configuration for a content site that wants AI visibility without being a training corpus looks like this:

# Block training bots
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: Applebot-Extended
Disallow: /

# Allow search and user-fetch bots
User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Claude-SearchBot
Allow: /

This pattern, blocking trainers while allowing fetchers and search bots, has become common among publishers. Per Originality.ai's tracking, 88% of top global news outlets now block at least one major AI training crawler. For commerce or SaaS sites the calculus is different: most leave training bots open because being in the training set helps brand recall in model outputs.

ai.txt: The Permission and Licensing Layer

ai.txt is a different beast. It was proposed by Spawning AI, the team behind Have I Been Trained, as a standardized file that expresses your training preferences in a structured, machine-readable way. The intent isn't to block crawlers. It's to declare consent.

A minimal ai.txt looks roughly like this:

User-Agent: *
Disallow: images/
Disallow: video/
Disallow: text/

Spawning's spec uses content types instead of paths, signaling "I don't consent to my images being used for training." The file is meant to be read by good-faith training operators, dataset curators, and (in theory) auditors who want to know who opted out.

A few honest observations about ai.txt in 2026:

Adoption is thin. Most sites don't have one. The audience is dataset curators rather than mainstream crawler engineers, and the compliance loop is slower.
It's a signal, not a barrier. ai.txt doesn't prevent fetches. It expresses preferences. A crawler that ignores ai.txt is doing nothing technically wrong, just ethically questionable.
It complements robots.txt. robots.txt says "do not crawl." ai.txt says "if you do crawl, here's what you may use it for."
It matters more for creator-heavy sites. Image hosts, art portfolios, music sites, and stock platforms are the most likely to use ai.txt because the licensing question is more acute for them.

If you care about being able to say "we expressed non-consent for training," ai.txt is worth adding. It's a five-minute change. If you only care about access control, robots.txt does more.

llms.txt: The Developer Discovery File

Now the file with the most hype and the most misunderstanding.

llms.txt was proposed by Jeremy Howard in September 2024, and the spec lives at llmstxt.org. Its purpose is narrow and specific. It's a markdown file at the root of a domain that gives AI coding agents (Cursor, Claude Code, Devin, and similar) a curated map of your documentation. The format looks like this:

# My Project

> A short description of the project so an LLM has context.

## Docs

- [Getting Started](https://example.com/docs/getting-started.md): Quick setup
- [API Reference](https://example.com/docs/api.md): Full API surface
- [Configuration](https://example.com/docs/config.md): Config options

## Optional

- [Changelog](https://example.com/changelog.md): Release notes

The format is intentionally simple. It's H1 (project name), blockquote (description), then sections of links. Each link points to a markdown version of the page. An agent reading llms.txt can quickly understand what your project does and where the canonical docs live, without parsing your full HTML, sidebar, and navigation.

Mintlify and Anthropic extended this with llms-full.txt, an inline-everything version. Instead of linking to separate files, llms-full.txt contains the full markdown of all your documentation in one document. Mintlify's breakdown of the file explains the use case: when a coding agent is reasoning about your library, it can pull one file and have all your docs in its context window. No follow-up fetches needed.

Now, the part that gets misreported in SEO content:

llms.txt is not a citation signal. It does not tell ChatGPT, Claude, or Perplexity to cite you more often.
llms.txt is not a crawl directive. It does not block or invite any crawler.
llms.txt is not used by Google. Gary Illyes from Google stated publicly that Google has no plans to use it.
llms.txt does not improve your AI search ranking. There is no measurable effect on visibility in ChatGPT, Perplexity, or Claude Web because none of those products read it as a ranking input.

What it does do, well: if your audience uses coding agents to consume your docs, llms.txt makes that experience cleaner. The Anthropic docs site, Cloudflare's docs, Mintlify-hosted projects, and many open-source SDKs publish llms.txt because their docs are routinely loaded into Cursor or Claude Code by developers building integrations.

That's the real use case. It's a developer-tool feature, not a marketing feature.

What Each File Controls, Side by Side

Property	robots.txt	ai.txt	llms.txt
Primary purpose	Crawl access control	Training/licensing preference	Curated doc index for AI agents
Who reads it	All search and AI crawlers	Dataset curators, Spawning AI tools	AI coding agents (Cursor, Claude Code, etc.)
Who proposed it	Martijn Koster, 1994 (RFC 9309 in 2022)	Spawning AI	Jeremy Howard, Sept 2024
Enforcement	Honored by all major operators	Voluntary, audited externally	Voluntary, agent-side decision
Current adoption	Near-universal	Single digits of %	~10% of crawled domains (SE Ranking)
Effect on AI search visibility	Direct (allows/blocks indexing bots)	None	None
Effect on training inclusion	Direct (blocks training bots)	Signal only	None
Time to impact	Hours to days	Months (depends on dataset cadence)	Immediate for agents that support it
Maintenance burden	Low	Very low	Medium (must stay in sync with docs)

The most important row in that table is "effect on AI search visibility." Only one of these files actually moves the needle there, and it's the one that's been around for 30 years.

The Cloudflare Watershed: July 2025

A short history lesson, because it matters for what's coming.

In July 2024, Cloudflare launched a one-click toggle to block AI bots, scrapers, and crawlers for any site on their network. It was framed as "Declaring Your AIndependence." It was opt-in. Many sites adopted it quickly, especially publishers.

A year later, on July 1, 2025, Cloudflare flipped the default. New domains added to Cloudflare now block AI crawlers by default. Existing customers were given a one-click upgrade. Cloudflare called it a "permission-based" model: AI operators have to negotiate access rather than scrape by default.

Cloudflare sits in front of roughly 20% of the public web. Their move effectively converted a substantial chunk of the internet from open-by-default to closed-by-default for AI training.

Some numbers from Cloudflare's own data for H2 2025:

416 billion AI bot requests logged across the network.
GPTBot traffic up 147% year-over-year, indicating OpenAI is fetching more aggressively even as more sites block.
Meta-ExternalAgent traffic up 843% YoY, the largest growth of any AI crawler in their dataset.
2.5 million sites opted into Cloudflare's managed robots.txt for AI, where Cloudflare maintains the bot list for you.

The "managed robots.txt" detail hints at where the ecosystem is going: bot lists change too fast for individual sites to maintain. A new AI startup launches every month, each with its own user-agent. Increasingly, sites delegate to an infrastructure layer that maintains the list centrally.

If you're on Cloudflare and haven't checked your bot management settings since 2024, check them. The default has changed under you.

The Adoption Reality Check

It's tempting, reading SEO Twitter, to think llms.txt is everywhere. It isn't.

SE Ranking analyzed over 300,000 domains in early 2026 and found llms.txt adoption sits around 10% (and skews heavily toward technical and developer-facing sites). Presenc.ai's State of llms.txt 2026 report found similar numbers, with adoption concentrated in SaaS docs, AI tooling companies, and open-source projects.

A few patterns from the data:

Documentation-heavy SaaS leads adoption. Anthropic, Cursor, Mintlify, Vercel, Cloudflare, and Supabase almost all publish llms.txt and llms-full.txt.
Marketing and content sites lag. News outlets, blogs, and B2B marketing sites mostly don't have llms.txt. The use case is weaker there because the audience isn't coding agents.
Adoption is growing, slowly. Roughly doubling year over year, but from a small base.
Compliance among agents is partial. Cursor and Claude Code support reading llms.txt when a user references a domain. Most other agents either don't read it or use it only as a fallback.

The honest take: llms.txt is a real spec with a real, narrow use case. It's not a hidden ranking factor. It's not a substitute for good docs. It's a convenience file for one specific audience. The same applies to ai.txt, more bluntly. Outside creator-heavy verticals, adoption is small. robots.txt remains the only file in this set that genuinely controls something at scale.

What to Actually Do: A Pragmatic Setup

A framework that covers most operators:

Step 1: Decide your AI training posture. Content-first (publisher, blog, news, education)? You probably want to block training bots and allow search bots. SaaS or product-led? You probably want to be in the training data because it helps brand visibility in model outputs.

Step 2: Write a deliberate robots.txt. Don't copy-paste from random gists. Pick from the user-agent table above and write the directives explicitly. Test with curl -A "GPTBot" to confirm the right pages are blocked.

Step 3: Add ai.txt if licensing matters. Five minutes, zero cost. If you ever need to demonstrate that you expressed non-consent for training, an ai.txt on file is useful. If you don't care, skip it.

Step 4: Add llms.txt only if you have docs and an agent audience. Open-source library, developer-platform SaaS, or any product integrated into other people's code via AI assistants? Publish llms.txt and ideally llms-full.txt. Marketing site, content blog, non-technical SaaS? The file gives you nothing.

Step 5: If you're on Cloudflare, configure once at the edge. Their bot management gives you a centrally maintained block list. For most operators that's better than maintaining robots.txt by hand.

Step 6: Watch your logs. AI crawlers respect robots.txt mostly, but not perfectly. Periodically tail your access logs for the user agents above and confirm behavior matches your config. If a bot you blocked is still hitting you, file a complaint with the operator.

What you don't need to do: agonize over llms.txt for SEO. It will not affect your AI search visibility. It will not make ChatGPT cite you.

Edge Cases: Cloudflare AI Audit, Pay-Per-Crawl, Verified Bots

A few features worth knowing about, mostly because they hint at where the ecosystem is going.

Cloudflare AI Audit. A dashboard view of which AI bots are hitting your site, how often, and where they're going. Free for Cloudflare customers. Useful for spotting a new bot you haven't seen before and for verifying that bots you blocked are actually staying out.

Cloudflare Pay-Per-Crawl. Announced in mid-2025, this lets site owners charge AI crawlers per request rather than block them outright. The model is early and adoption is limited, but it points at a future where access negotiation is automated rather than binary (block / allow).

Verified Bot program. Both Cloudflare and Google maintain registries that confirm a user-agent string actually belongs to the claimed operator. This matters because spoofing is common: a scraper can set User-Agent: GPTBot and pretend to be OpenAI. Verified bot programs check source IPs against the operator's published ranges. If you're seeing GPTBot traffic from non-OpenAI IPs, it's a spoofer, and blocking by IP is the right response.

The "agentic browse" question. When ChatGPT or Claude fetches a page on a user's behalf, that fetch uses a different user agent (ChatGPT-User, Claude-User). Blocking these means the model can't read pages users paste to it, which usually isn't what publishers actually want. Keep agentic-browse bots allowed unless you have a specific reason to block.

Where This Is Headed

A few honest forecasts for the next 18 months:

A standard is forming, and it's not llms.txt. The IETF AI Preferences Working Group (AIPREF) is drafting a more comprehensive standard for AI training and usage preferences. It's likely to formalize the ai.txt-style "express your preferences" model with proper machine-readable semantics. Once it lands as an RFC, it'll probably absorb the use cases ai.txt is currently filling.

Pay-per-crawl spreads. Cloudflare won't be the only platform offering it. Expect Akamai, Fastly, and the cloud CDNs to launch similar mechanisms. The world where every AI crawler has a metered relationship with every site is plausible by 2027.

Bot lists go centralized. Maintaining your own list of AI user-agents was reasonable in 2023, with maybe a dozen names to track. It's now closer to 40 and growing. Most operators will end up trusting an infrastructure layer to keep the list current.

llms.txt persists in its niche. It won't go away. It also won't become a ranking factor. It'll continue serving the agentic-tool audience and likely formalize into a more standardized spec once enough agents support it.

The meta-pattern: the open-by-default web is being slowly replaced by a permission-based web for AI traffic, mediated by infrastructure platforms rather than per-site configs. robots.txt is the legacy interface to that world. ai.txt and llms.txt are early attempts at richer signaling. The IETF and CDN industry are quietly working on the version that'll actually scale.

Frequently Asked Questions

Does Google read my llms.txt file?

No. Gary Illyes from Google publicly stated in 2025 that Google has no plans to use llms.txt as an input for any product. Adding llms.txt does not affect Google Search, Gemini, or AI Overviews. If you want to influence Google's AI products, the relevant signal is the Google-Extended user agent in robots.txt and the standard search index, not llms.txt.

Should I block all AI crawlers via robots.txt?

It depends on what kind of site you run. Publishers and content-first sites often block training bots (GPTBot, ClaudeBot, Google-Extended, CCBot, Meta-ExternalAgent, Bytespider) while allowing search and user-fetch bots (OAI-SearchBot, PerplexityBot, ChatGPT-User). SaaS and product sites usually leave everything open because being in training data helps brand visibility. A blanket block of every AI bot is rarely the right choice for non-publishers, because it costs you AI-driven discovery.

Is ai.txt actually supported by anyone?

Spawning AI honors it, as do a handful of dataset curators and ethical-AI projects. Major model trainers (OpenAI, Anthropic, Google, Meta) primarily honor robots.txt, not ai.txt. So ai.txt is a useful signaling layer for the "we expressed non-consent" posture, but it shouldn't be relied on as access control. Pair it with robots.txt for actual blocking.

What's the difference between llms.txt and llms-full.txt?

llms.txt is an index file: a short list of links pointing to markdown versions of your docs. llms-full.txt is the inlined version: all your docs concatenated into one large markdown file. The trade-off is bandwidth versus convenience. llms.txt is light to fetch but requires the agent to follow links. llms-full.txt is heavy but lets an agent load your entire docs into context with a single request. Most projects that publish one publish both.

If I block GPTBot in robots.txt, does that block ChatGPT browsing too?

No. GPTBot is OpenAI's training crawler. ChatGPT-User is the user agent ChatGPT uses when a user explicitly asks it to read a webpage. They are separate user agents in robots.txt. Blocking GPTBot opts you out of training. ChatGPT-User remains allowed unless you block it separately. Most publishers want this exact split: block training, allow user-initiated fetches.

Will llms.txt help me rank in ChatGPT or Perplexity?

No, not as a citation or ranking signal. ChatGPT and Perplexity surface content based on what they've indexed via their search crawlers (OAI-SearchBot, PerplexityBot) and on training data. llms.txt is read by coding agents like Cursor and Claude Code, not by the chat products. If you want to be cited in ChatGPT, the priorities are: (1) keep OAI-SearchBot unblocked in robots.txt, (2) publish content that answers specific questions clearly, and (3) earn citations from sources those models trust. llms.txt isn't on that list.

Closing Thoughts

What frustrates me about the current discourse around AI crawler control is how confidently bad the advice is. "Add llms.txt and you'll rank in ChatGPT." "Block everything via ai.txt." "robots.txt is dead, llms.txt is the future." Each of these is wrong in a different direction.

The truth is duller and more useful: robots.txt still does the real work. ai.txt expresses a preference that some operators honor. llms.txt is a developer-tool convenience for a specific audience. None of them is a magic ranking lever, and treating them like one wastes time you could spend on things that actually matter.

If you remember nothing else, remember the three jobs. robots.txt is the access gate. ai.txt is the licensing signal. llms.txt is the developer index. Configure each for what it actually does, ignore the rest of the noise, and you'll be ahead of most operators currently chasing trends without understanding them.

And keep an eye on AIPREF. The next year or two of AI crawler control is going to be shaped less by these three files and more by what the IETF and the CDN industry standardize next. The current state is a stopgap.