WHY MOST ROBOTS.TXT FILES ARE WRONG.
Most robots.txt files I see in audits are some combination of: copied from a 2023 template (so missing 2024-2026 bots), copy-pasted with the wrong indentation (so silently invalid), referencing deprecated bot names (so doing nothing), or carrying a blanket User-agent: * + Disallow: / from a staging environment that nobody updated for production.
The current 2026 list of AI-relevant user-agents is straightforward, but it has changed enough since 2024 that any older template is partially out of date. Here is the active list, the deprecated list, and four ready-to-paste configurations.
THE CURRENT BOT LIST.
The user-agents that matter in 2026, organised by purpose:
- Search / citation:
OAI-SearchBot,ChatGPT-User,Claude-SearchBot,Claude-User,PerplexityBot,Perplexity-User,Google-Extended,Applebot-Extended,Bingbot,DuckDuckBot - Training:
GPTBot,ClaudeBot,cohere-ai,Bytespider,Meta-ExternalAgent,Amazonbot - Mixed / general:
FacebookBot,CCBot(Common Crawl, feeds many training datasets)
THE DEPRECATED IDENTIFIERS.
Still showing up in copy-pasted templates everywhere. Remove these from your config; they do nothing:
Claude-Web- replaced by Claude-User in early 2026.anthropic-ai- replaced by ClaudeBot + Claude-User split.Google-Extendedapplied asgooglebot- separate identifier, do not confuse.- Generic
aibot,llmbot- never were real identifiers; just pattern-matching attempts that catch nothing.
CONFIG 01 - ALLOW ALL.
If you want maximum AI visibility and accept training-data exposure as the cost, this is the simplest correct config. Explicit allows make the intent obvious; the default for unmatched bots is also allow.
User-agent: *
Allow: /
Sitemap: https://yoursite.com/sitemap.xml
This is the right default for content-driven sites whose business model benefits from being cited.
CONFIG 02 - SEARCH/CITATION ONLY.
Allow the bots that drive citations, block the bots that train models. This is the most common policy for sites whose content is editorial-protected or licensed.
- Allow:
OAI-SearchBot,ChatGPT-User,Claude-SearchBot,Claude-User,PerplexityBot,Perplexity-User,Google-Extended,Bingbot,Applebot-Extended - Disallow:
GPTBot,ClaudeBot,cohere-ai,Bytespider,Meta-ExternalAgent,CCBot
CONFIG 03 - NAMED BOTS ONLY.
Block everything by default; allow only an explicit named list. Highest control, requires maintenance every time a new AI bot launches.
User-agent: *+Disallow: /- Then explicit
User-agent: <name>+Allow: /blocks for each bot you want to allow. - This is the right policy for sites with strict content-licensing constraints (paywalled news, regulated industries with publication restrictions).
CONFIG 04 - BLOCK ALL AI.
Block every AI bot explicitly. Note: this also blocks AI search citations entirely.
- Disallow each named AI bot explicitly. Wildcard
*patterns are unreliable for AI bots; use named identifiers. - Keep
User-agent: Bingbotand traditional search engines explicitly allowed if you still want Google/Bing search traffic. - Right for sites with explicit no-AI policies for legal or licensing reasons. Rare. Verify the business case before applying.
MAINTENANCE SCHEDULE.
AI bots change quarterly. Set a calendar reminder every 90 days to: re-check the active bot list, remove deprecated identifiers from your config, add any new named bots, re-run the curl detection from article N. 09 to verify the policy is actually being enforced. Five minutes of work; saves you the kind of silent-failure debt that kills AI visibility over time.