AI Crawler User-Agent Reference & Markdown-Serving Strategy

Tue Apr 14 2026 · Nitin Bansal

Compiled May 2026 from SEJ, Cloudflare Radar, Evil Martians, CaptainDNS, and live server-log analysis.

Part 1: The 19+ Most Common AI Crawlers

OpenAI Family

Token (robots.txt)	Full User-Agent	Purpose	Crawl Rate	IP Verify
GPTBot	`Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.3; +https://openai.com/gptbot)`	Training GPT models	~100 pages/hr	openai.com/gptbot.json
ChatGPT-User	`Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ChatGPT-User/1.0; +https://openai.com/bot`	User-triggered browsing from ChatGPT	~2400 pages/hr	openai.com/chatgpt-user.json
OAI-SearchBot	`Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36; compatible; OAI-SearchBot/1.3; +https://openai.com/searchbot`	ChatGPT Search indexing	~150 pages/hr	openai.com/searchbot.json

Anthropic Family

Token (robots.txt)	Full User-Agent	Purpose	Crawl Rate	IP Verify
ClaudeBot	`Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)`	Training Claude models	~500 pages/hr	docs.claude.com/en/api/ip-addresses
Claude-User	`Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Claude-User/1.0; +Claude-User@anthropic.com)`	User-triggered browsing from Claude	<10 pages/hr	Not published
Claude-SearchBot	`Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Claude-SearchBot/1.0; +https://www.anthropic.com)`	Claude Search indexing	<10 pages/hr	Not published
anthropic-ai	`Mozilla/5.0 (compatible; anthropic-ai/1.0; +https://www.anthropic.com)`	Legacy training crawler	Variable	Not published

Perplexity Family

Token (robots.txt)	Full User-Agent	Purpose	Crawl Rate	IP Verify
PerplexityBot	`Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; PerplexityBot/1.0; +https://perplexity.ai/perplexitybot)`	Answer engine indexing	~150 pages/hr	perplexity.com/perplexitybot.json
Perplexity-User	`Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Perplexity-User/1.0; +https://perplexity.ai/perplexity-user)`	User-triggered browsing	<10 pages/hr	perplexity.com/perplexity-user.json

Google Family

Token (robots.txt)	Full User-Agent	Purpose	IP Verify
Google-Extended	(Uses Googlebot UA — this is a purpose token, not a separate crawler)	Controls Gemini AI training use of Googlebot data	googlebot.json
Google-CloudVertexBot	`Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMP29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/141.0.7390.122 Mobile Safari/537.36 (compatible; Google-CloudVertexBot; +https://cloud.google.com/enterprise-search)`	Vertex AI Agent Builder	Same IP range
Gemini-Deep-Research	`Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Gemini-Deep-Research; +https://gemini.google/overview/deep-research/) Chrome/135.0.0.0 Safari/537.36`	Gemini deep research agent	Same IP range
GoogleAgent-Mariner	`GoogleAgent-Mariner/1.0`	Project Mariner agentic browser	Same IP range

Meta Family

Token (robots.txt)	Full User-Agent	Purpose	Crawl Rate
meta-externalagent	`meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/crawler)`	AI training for Llama models	~1100 pages/hr
Meta-WebIndexer	`meta-webindexer/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/crawler)`	Meta AI search indexing	<10 pages/hr

Other Major Players

Token (robots.txt)	Full User-Agent	Company	Purpose
Bytespider	`Mozilla/5.0 (Linux; Android 5.0) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; Bytespider; https://zhanzhang.toutiao.com/)`	ByteDance	TikTok AI training
Amazonbot	`Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot) Chrome/119.0.6045.214 Safari/537.36`	Amazon	Alexa/AI training
Applebot-Extended	`Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.4 Safari/605.1.15 (Applebot/0.1; +http://www.apple.com/go/applebot)`	Apple	Apple Intelligence training
DuckAssistBot	`DuckAssistBot/1.2; (+http://duckduckgo.com/duckassistbot.html)`	DuckDuckGo	AI answer feature
CCBot	`CCBot/2.0 (https://commoncrawl.org/faq/)`	Common Crawl	Open training dataset
cohere-ai	`cohere-ai/1.0`	Cohere	Model training
AI2Bot	`AI2Bot/1.0`	Allen Institute	Semantic Scholar
Diffbot	`Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2 (.NET CLR 3.5.30729; Diffbot/0.1; +http://www.diffbot.com)`	Diffbot	Structured data extraction
MistralAI-User	`Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; MistralAI-User/1.0; +https://docs.mistral.ai/robots)`	Mistral	Le Chat citations
YouBot	`YouBot/1.0`	You.com	AI search

Hard-to-Identify Crawlers

Entity	Issue
xAI Grok	Documented agents (`GrokBot/1.0`, `xAI-Grok/1.0`, `Grok-DeepSearch/1.0`) rarely seen; reportedly uses iPhone UA strings
ChatGPT Atlas	Uses standard Chrome UA — indistinguishable from real users
OpenAI Operator	No declared UA; appears as Chrome from remote browser
Bing Copilot	No declared crawler UA; uses Bingbot's data

Ranked by % of all AI+search crawl traffic:

Rank	Bot	Share May '25	YoY Change
1	Googlebot	50%	+20pp
2	Bingbot	8.7%	-1.3pp
3	GPTBot	7.7%	+5.5pp (+305% req)
4	ClaudeBot	5.4%	-6.3pp
5	GoogleOther	4.3%	-0.1pp
6	Amazonbot	4.2%	-3.4pp
7	Googlebot-Image	3.3%	-1.2pp
8	Bytespider	2.9%	-19.8pp (-85%)
9	Yandex	2.2%	-0.7pp
10	ChatGPT-User	1.3%	+1.2pp (+2825%)
11	Applebot	1.2%	-0.7pp
14	PerplexityBot	0.2%	+157,490%

Part 3: Accept Header & Content Negotiation

The `Accept: text/markdown` Standard

This is the correct, standards-compliant way to serve Markdown to AI agents. HTTP content negotiation (RFC 9110, since 1997) lets clients request a preferred format:

Accept: text/markdown, text/html;q=0.9

Known Clients That Send `Accept: text/markdown`

Real-world data from 44 days of measurement (Suganthan, Mar-Apr 2026):

Client	Requests	Notes
Chrome Headless (RAG pipelines)	639	Automated retrieval, not real browsers
Claude infrastructure	500	Anthropic's own stack (not ClaudeBot!)
axios (node pipelines)	211	Common in AI data pipelines
curl	13	Manual/scripted
markdown.new/1.0	6	Dedicated markdown fetcher
qodercli/1.0	3	AI coding tool (Qoder)
MarkdownWorker/1.0	1	Another dedicated markdown fetcher

Key finding: Tools like Claude Code, Cursor, and coding assistants actively send Accept: text/markdown. Claude's own infrastructure made up 35% of all markdown requests. The ecosystem is growing in real time.

Cloudflare Markdown for Agents

Cloudflare can auto-convert HTML → Markdown on the fly when a client sends Accept: text/markdown. Response includes:

Content-Type: text/markdown; charset=utf-8
Vary: Accept (for CDN caching separation)
x-markdown-tokens (estimated token count)
Content-Signal: ai-train=yes, search=yes, ai-input=yes

Limitations: Chunked transfer encoding (no Content-Length) or payload >2MB causes silent fallback to HTML.

Traditional AI Crawler Accept Headers

Most major AI crawlers do not send Accept: text/markdown yet. They typically send standard browser-like Accept headers:

Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8

The evidence is clear: major crawlers (GPTBot, ClaudeBot, PerplexityBot, etc.) do not request /llms.txt or .md routes unprompted. But user-initiated AI tools (someone pasting your URL into ChatGPT/Claude) and coding agents (Cursor, Claude Code) do use content negotiation. This is where the value is today.

Part 4: How to Serve Markdown Instead of HTML to AI Clients

Strategy 1: Content Negotiation (Recommended, Standards-Compliant)

Serve Markdown when Accept: text/markdown is present, HTML otherwise:

// Middleware example
export async function handleRequest(request) {
  const url = new URL(request.url);
  const accept = request.headers.get('accept') ?? '';

  if (accept.includes('text/markdown')) {
    return new Response(markdownContent, {
      headers: {
        'Content-Type': 'text/markdown; charset=utf-8',
        'Vary': 'Accept',
      },
    });
  }

  return new Response(htmlContent, {
    headers: {
      'Content-Type': 'text/html; charset=utf-8',
      'Vary': 'Accept',
      'Link': `<${url.pathname}.md>; rel="alternate"; type="text/markdown"`,
    },
  });
}

Strategy 2: Serve `.md` Routes

Every page gets a clean Markdown twin at the same URL with .md appended:

/blog/my-post      → HTML
/blog/my-post.md   → Markdown (Content-Type: text/markdown)

Strategy 3: `llms.txt` + `llms-full.txt`

Place at root (https://yoursite.com/llms.txt):

# Your Site Name

> One-line description for AI systems.

## Blog

- [Post Title](/blog/my-post): Brief description
- [Another Post](/blog/another): Brief description

## Documentation

- [API Reference](/docs/api): Full endpoint docs

Important: Serve llms.txt as direct 200 OK — no redirects. PerplexityBot doesn't follow redirects on this file at all.

Strategy 4: Discovery Signals

HTML <link> tag (in <head>):

<link rel="alternate" type="text/markdown" title="Markdown" href="/blog/my-post.md">

HTTP Link header:

Link: </blog/my-post.md>; rel="alternate"; type="text/markdown"

Hidden div for LLMs (when users paste URLs into chat):

<div class="visually-hidden" aria-hidden="true">
  A Markdown version of this page is available at
  https://yoursite.com/blog/my-post.md
</div>

Strategy 5: Server Log Monitoring

See which AI crawlers are hitting your site:

grep -Ei "gptbot|oai-searchbot|chatgpt-user|claudebot|perplexitybot|google-extended|bingbot|meta-externalagent|amazonbot|bytespider" access.log | awk '{print $1,$4,$7,$12}' | head -50

Redirect Best Practices for AI Crawlers

AI crawlers are less tolerant of redirect chains than Googlebot:

Crawler	Max Hops Tolerated
GPTBot (training)	5
ClaudeBot (training)	5
PerplexityBot (indexing)	5
OAI-SearchBot (real-time)	3
Claude-SearchBot (real-time)	3
Perplexity-User (on-demand)	3
Googlebot (reference)	10

Target: 1 hop maximum for pages you want cited in AI answers.

Part 5: Quick Reference — robots.txt Tokens

# ——— TRAINING CRAWLERS (block if you don't want content used for training) ———
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: meta-externalagent
Disallow: /

User-agent: cohere-ai
Disallow: /

User-agent: AI2Bot
Disallow: /

# ——— SEARCH / CITATION CRAWLERS (allow if you want AI search visibility) ———
User-agent: OAI-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: DuckAssistBot
Allow: /

# ——— USER-TRIGGERED AGENTS (allow for chat citations) ———
User-agent: ChatGPT-User
Allow: /

User-agent: Perplexity-User
Allow: /

User-agent: MistralAI-User
Allow: /

User-agent: Claude-User
Allow: /

AI Crawler User-Agent Reference & Markdown-Serving Strategy

Related