Say Hello

BOT ACCESS
POLICY.

There is no single right answer to whether you should allow AI bots. A decision framework based on five inputs: business type, content sensitivity, training-data tolerance, desired citation outcome, and bandwidth concern.

Policy Tiers
4
Decision Inputs
5
Default
Allow
Edge Cases
Several

THERE IS NO SINGLE RIGHT ANSWER.

The default SEO advice is "allow all AI bots." The default privacy advice is "block all AI bots." Both are wrong as universal answers; both are right for specific business types. The correct policy depends on five inputs, weighed in combination.

This is the framework I use in audits when a client asks "should we open up to AI?" The decision is rarely obvious until the inputs are explicit.

Finding 01.
Input 01

BUSINESS TYPE.

Content publishers (media, blogs, research, B2B SaaS marketing): AI citations are top-of-funnel traffic. Allow is the default.

E-commerce: AI citations drive consideration-stage traffic. Allow is the default for marketing pages; block / disallow on cart, checkout, account paths.

Service businesses (law, medical, accounting, consulting): AI citations help discovery. Allow content pages; block / disallow client-portal areas.

Paywalled / subscription content: complex. Most major paywalled publishers (NYT, Bloomberg, FT) negotiate licensing deals separately and block training while allowing search citation. Pattern: allow OAI-SearchBot, ChatGPT-User, Claude-SearchBot, Claude-User; block GPTBot, ClaudeBot, CCBot, Bytespider.

Regulated content: legal/medical/financial advice with publication restrictions. Often requires explicit licensing review before allowing AI training. Default to block training; allow citation if professional rules permit.

Finding 02.
Input 02

CONTENT SENSITIVITY.

Different sensitivity tiers map to different policies:

  • Public marketing / educational: low sensitivity. Allow all by default.
  • Customer data, account information: never AI-accessible. Block all bots; serve auth-required pages with proper auth, not just noindex.
  • Pricing, internal tools, beta features: medium sensitivity. Disallow in robots.txt; rely on access control as the real protection.
  • Forum / user-generated content: depends on TOS with users. Most large platforms allow AI scraping; some explicitly opt out per user.
Finding 03.
Input 03

TRAINING-DATA TOLERANCE.

How comfortable are you with your content training future AI models?

Comfortable: allow GPTBot, ClaudeBot, CCBot, Bytespider, Meta-ExternalAgent. Default for most content publishers.

Mixed (allow citation, block training): the most common policy for premium content. Block training-only bots, allow citation/search bots. See article N. 10 Config 02 for the specific lines.

Block all training: required for content with strict licensing, regulated content, content under unresolved litigation. Use named-bot blocklist (article N. 10 Config 03).

Finding 04.
Input 04

CITATION OUTCOME.

What do you want to happen when an AI cites you?

  • Maximum citation visibility: allow all citation/search bots (OAI-SearchBot, Claude-SearchBot, PerplexityBot, ChatGPT-User, Claude-User, Perplexity-User, Google-Extended, Applebot-Extended). Even if blocking training.
  • Selective citation: allow only the surfaces your audience uses. (Western B2B audience? Likely ChatGPT + Perplexity matter most. East Asian audience? Bytespider + Doubao matter more.)
  • Zero citation visibility: rare. Block all AI bots including citation bots. Verify there is a real reason; "we don't want our content cited" is sometimes a misunderstanding of how AI citations work.
Finding 05.
Input 05

BANDWIDTH AND COST.

AI bots can be aggressive crawlers, especially Bytespider and PerplexityBot. For high-traffic sites this is fine; for low-traffic sites on tight bandwidth budgets, the cost matters.

The fix is rarely "block them." The fix is rate-limiting at Cloudflare / origin level: 60-120 requests/minute per IP is plenty for any AI bot's normal crawl. Hard 429s on bursts cause the bot to back off without cutting visibility.

THE FOUR POLICIES.

Synthesised from the five inputs:

THE BOTTOM LINE.

There is no universal correct policy. Run the five inputs as a checklist before changing your robots.txt. "Allow all" is the right default for ~70% of sites; "allow citation, block training" is right for ~20%; the named-allowlist and block-all policies are correct for the remaining ~10% combined. Knowing which 10% you are in is the entire question.

Stop Guessing What AI Sees

MEASURE THE LEVERS
THAT ACTUALLY EXIST.

If you want this methodology applied to your specific site - your real logs, your real citation data, your real fix list - the audit is the productized way to do it.