nb1t.sh

AI Crawler User-Agent Reference & Markdown-Serving Strategy

Tue Apr 14 2026 · Nitin Bansal

Compiled May 2026 from SEJ, Cloudflare Radar, Evil Martians, CaptainDNS, and live server-log analysis.


Part 1: The 19+ Most Common AI Crawlers

OpenAI Family

Token (robots.txt) Full User-Agent Purpose Crawl Rate IP Verify
GPTBot Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.3; +https://openai.com/gptbot) Training GPT models ~100 pages/hr openai.com/gptbot.json
ChatGPT-User Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ChatGPT-User/1.0; +https://openai.com/bot User-triggered browsing from ChatGPT ~2400 pages/hr openai.com/chatgpt-user.json
OAI-SearchBot Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36; compatible; OAI-SearchBot/1.3; +https://openai.com/searchbot ChatGPT Search indexing ~150 pages/hr openai.com/searchbot.json

Anthropic Family

Token (robots.txt) Full User-Agent Purpose Crawl Rate IP Verify
ClaudeBot Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com) Training Claude models ~500 pages/hr docs.claude.com/en/api/ip-addresses
Claude-User Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Claude-User/1.0; +Claude-User@anthropic.com) User-triggered browsing from Claude <10 pages/hr Not published
Claude-SearchBot Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Claude-SearchBot/1.0; +https://www.anthropic.com) Claude Search indexing <10 pages/hr Not published
anthropic-ai Mozilla/5.0 (compatible; anthropic-ai/1.0; +https://www.anthropic.com) Legacy training crawler Variable Not published

Perplexity Family

Token (robots.txt) Full User-Agent Purpose Crawl Rate IP Verify
PerplexityBot Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; PerplexityBot/1.0; +https://perplexity.ai/perplexitybot) Answer engine indexing ~150 pages/hr perplexity.com/perplexitybot.json
Perplexity-User Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Perplexity-User/1.0; +https://perplexity.ai/perplexity-user) User-triggered browsing <10 pages/hr perplexity.com/perplexity-user.json

Google Family

Token (robots.txt) Full User-Agent Purpose IP Verify
Google-Extended (Uses Googlebot UA — this is a purpose token, not a separate crawler) Controls Gemini AI training use of Googlebot data googlebot.json
Google-CloudVertexBot Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMP29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/141.0.7390.122 Mobile Safari/537.36 (compatible; Google-CloudVertexBot; +https://cloud.google.com/enterprise-search) Vertex AI Agent Builder Same IP range
Gemini-Deep-Research Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Gemini-Deep-Research; +https://gemini.google/overview/deep-research/) Chrome/135.0.0.0 Safari/537.36 Gemini deep research agent Same IP range
GoogleAgent-Mariner GoogleAgent-Mariner/1.0 Project Mariner agentic browser Same IP range

Meta Family

Token (robots.txt) Full User-Agent Purpose Crawl Rate
meta-externalagent meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/crawler) AI training for Llama models ~1100 pages/hr
Meta-WebIndexer meta-webindexer/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/crawler) Meta AI search indexing <10 pages/hr

Other Major Players

Token (robots.txt) Full User-Agent Company Purpose
Bytespider Mozilla/5.0 (Linux; Android 5.0) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; Bytespider; https://zhanzhang.toutiao.com/) ByteDance TikTok AI training
Amazonbot Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot) Chrome/119.0.6045.214 Safari/537.36 Amazon Alexa/AI training
Applebot-Extended Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.4 Safari/605.1.15 (Applebot/0.1; +http://www.apple.com/go/applebot) Apple Apple Intelligence training
DuckAssistBot DuckAssistBot/1.2; (+http://duckduckgo.com/duckassistbot.html) DuckDuckGo AI answer feature
CCBot CCBot/2.0 (https://commoncrawl.org/faq/) Common Crawl Open training dataset
cohere-ai cohere-ai/1.0 Cohere Model training
AI2Bot AI2Bot/1.0 Allen Institute Semantic Scholar
Diffbot Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2 (.NET CLR 3.5.30729; Diffbot/0.1; +http://www.diffbot.com) Diffbot Structured data extraction
MistralAI-User Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; MistralAI-User/1.0; +https://docs.mistral.ai/robots) Mistral Le Chat citations
YouBot YouBot/1.0 You.com AI search

Hard-to-Identify Crawlers

Entity Issue
xAI Grok Documented agents (GrokBot/1.0, xAI-Grok/1.0, Grok-DeepSearch/1.0) rarely seen; reportedly uses iPhone UA strings
ChatGPT Atlas Uses standard Chrome UA — indistinguishable from real users
OpenAI Operator No declared UA; appears as Chrome from remote browser
Bing Copilot No declared crawler UA; uses Bingbot's data

Part 2: Cloudflare Radar — Real-World Crawl Share (May 2025)

Ranked by % of all AI+search crawl traffic:

Rank Bot Share May '25 YoY Change
1 Googlebot 50% +20pp
2 Bingbot 8.7% -1.3pp
3 GPTBot 7.7% +5.5pp (+305% req)
4 ClaudeBot 5.4% -6.3pp
5 GoogleOther 4.3% -0.1pp
6 Amazonbot 4.2% -3.4pp
7 Googlebot-Image 3.3% -1.2pp
8 Bytespider 2.9% -19.8pp (-85%)
9 Yandex 2.2% -0.7pp
10 ChatGPT-User 1.3% +1.2pp (+2825%)
11 Applebot 1.2% -0.7pp
14 PerplexityBot 0.2% +157,490%

Part 3: Accept Header & Content Negotiation

The Accept: text/markdown Standard

This is the correct, standards-compliant way to serve Markdown to AI agents. HTTP content negotiation (RFC 9110, since 1997) lets clients request a preferred format:

Accept: text/markdown, text/html;q=0.9

Known Clients That Send Accept: text/markdown

Real-world data from 44 days of measurement (Suganthan, Mar-Apr 2026):

Client Requests Notes
Chrome Headless (RAG pipelines) 639 Automated retrieval, not real browsers
Claude infrastructure 500 Anthropic's own stack (not ClaudeBot!)
axios (node pipelines) 211 Common in AI data pipelines
curl 13 Manual/scripted
markdown.new/1.0 6 Dedicated markdown fetcher
qodercli/1.0 3 AI coding tool (Qoder)
MarkdownWorker/1.0 1 Another dedicated markdown fetcher

Key finding: Tools like Claude Code, Cursor, and coding assistants actively send Accept: text/markdown. Claude's own infrastructure made up 35% of all markdown requests. The ecosystem is growing in real time.

Cloudflare Markdown for Agents

Cloudflare can auto-convert HTML → Markdown on the fly when a client sends Accept: text/markdown. Response includes:

  • Content-Type: text/markdown; charset=utf-8
  • Vary: Accept (for CDN caching separation)
  • x-markdown-tokens (estimated token count)
  • Content-Signal: ai-train=yes, search=yes, ai-input=yes

Limitations: Chunked transfer encoding (no Content-Length) or payload >2MB causes silent fallback to HTML.

Traditional AI Crawler Accept Headers

Most major AI crawlers do not send Accept: text/markdown yet. They typically send standard browser-like Accept headers:

Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8

The evidence is clear: major crawlers (GPTBot, ClaudeBot, PerplexityBot, etc.) do not request /llms.txt or .md routes unprompted. But user-initiated AI tools (someone pasting your URL into ChatGPT/Claude) and coding agents (Cursor, Claude Code) do use content negotiation. This is where the value is today.


Part 4: How to Serve Markdown Instead of HTML to AI Clients

Serve Markdown when Accept: text/markdown is present, HTML otherwise:

// Middleware example
export async function handleRequest(request) {
  const url = new URL(request.url);
  const accept = request.headers.get('accept') ?? '';

  if (accept.includes('text/markdown')) {
    return new Response(markdownContent, {
      headers: {
        'Content-Type': 'text/markdown; charset=utf-8',
        'Vary': 'Accept',
      },
    });
  }

  return new Response(htmlContent, {
    headers: {
      'Content-Type': 'text/html; charset=utf-8',
      'Vary': 'Accept',
      'Link': `<${url.pathname}.md>; rel="alternate"; type="text/markdown"`,
    },
  });
}

Strategy 2: Serve .md Routes

Every page gets a clean Markdown twin at the same URL with .md appended:

/blog/my-post      → HTML
/blog/my-post.md   → Markdown (Content-Type: text/markdown)

Strategy 3: llms.txt + llms-full.txt

Place at root (https://yoursite.com/llms.txt):

# Your Site Name

> One-line description for AI systems.

## Blog

- [Post Title](/blog/my-post): Brief description
- [Another Post](/blog/another): Brief description

## Documentation

- [API Reference](/docs/api): Full endpoint docs

Important: Serve llms.txt as direct 200 OK — no redirects. PerplexityBot doesn't follow redirects on this file at all.

Strategy 4: Discovery Signals

HTML <link> tag (in <head>):

<link rel="alternate" type="text/markdown" title="Markdown" href="/blog/my-post.md">

HTTP Link header:

Link: </blog/my-post.md>; rel="alternate"; type="text/markdown"

Hidden div for LLMs (when users paste URLs into chat):

<div class="visually-hidden" aria-hidden="true">
  A Markdown version of this page is available at
  https://yoursite.com/blog/my-post.md
</div>

Strategy 5: Server Log Monitoring

See which AI crawlers are hitting your site:

grep -Ei "gptbot|oai-searchbot|chatgpt-user|claudebot|perplexitybot|google-extended|bingbot|meta-externalagent|amazonbot|bytespider" access.log | awk '{print $1,$4,$7,$12}' | head -50

Redirect Best Practices for AI Crawlers

AI crawlers are less tolerant of redirect chains than Googlebot:

Crawler Max Hops Tolerated
GPTBot (training) 5
ClaudeBot (training) 5
PerplexityBot (indexing) 5
OAI-SearchBot (real-time) 3
Claude-SearchBot (real-time) 3
Perplexity-User (on-demand) 3
Googlebot (reference) 10

Target: 1 hop maximum for pages you want cited in AI answers.


Part 5: Quick Reference — robots.txt Tokens

# ——— TRAINING CRAWLERS (block if you don't want content used for training) ———
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: meta-externalagent
Disallow: /

User-agent: cohere-ai
Disallow: /

User-agent: AI2Bot
Disallow: /

# ——— SEARCH / CITATION CRAWLERS (allow if you want AI search visibility) ———
User-agent: OAI-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: DuckAssistBot
Allow: /

# ——— USER-TRIGGERED AGENTS (allow for chat citations) ———
User-agent: ChatGPT-User
Allow: /

User-agent: Perplexity-User
Allow: /

User-agent: MistralAI-User
Allow: /

User-agent: Claude-User
Allow: /

Sources