Compiled May 2026 from SEJ, Cloudflare Radar, Evil Martians, CaptainDNS, and live server-log analysis.
Part 1: The 19+ Most Common AI Crawlers
OpenAI Family
| Token (robots.txt) | Full User-Agent | Purpose | Crawl Rate | IP Verify |
|---|---|---|---|---|
| GPTBot | Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.3; +https://openai.com/gptbot) |
Training GPT models | ~100 pages/hr | openai.com/gptbot.json |
| ChatGPT-User | Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ChatGPT-User/1.0; +https://openai.com/bot |
User-triggered browsing from ChatGPT | ~2400 pages/hr | openai.com/chatgpt-user.json |
| OAI-SearchBot | Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36; compatible; OAI-SearchBot/1.3; +https://openai.com/searchbot |
ChatGPT Search indexing | ~150 pages/hr | openai.com/searchbot.json |
Anthropic Family
| Token (robots.txt) | Full User-Agent | Purpose | Crawl Rate | IP Verify |
|---|---|---|---|---|
| ClaudeBot | Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com) |
Training Claude models | ~500 pages/hr | docs.claude.com/en/api/ip-addresses |
| Claude-User | Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Claude-User/1.0; +Claude-User@anthropic.com) |
User-triggered browsing from Claude | <10 pages/hr | Not published |
| Claude-SearchBot | Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Claude-SearchBot/1.0; +https://www.anthropic.com) |
Claude Search indexing | <10 pages/hr | Not published |
| anthropic-ai | Mozilla/5.0 (compatible; anthropic-ai/1.0; +https://www.anthropic.com) |
Legacy training crawler | Variable | Not published |
Perplexity Family
| Token (robots.txt) | Full User-Agent | Purpose | Crawl Rate | IP Verify |
|---|---|---|---|---|
| PerplexityBot | Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; PerplexityBot/1.0; +https://perplexity.ai/perplexitybot) |
Answer engine indexing | ~150 pages/hr | perplexity.com/perplexitybot.json |
| Perplexity-User | Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Perplexity-User/1.0; +https://perplexity.ai/perplexity-user) |
User-triggered browsing | <10 pages/hr | perplexity.com/perplexity-user.json |
Google Family
| Token (robots.txt) | Full User-Agent | Purpose | IP Verify |
|---|---|---|---|
| Google-Extended | (Uses Googlebot UA — this is a purpose token, not a separate crawler) | Controls Gemini AI training use of Googlebot data | googlebot.json |
| Google-CloudVertexBot | Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMP29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/141.0.7390.122 Mobile Safari/537.36 (compatible; Google-CloudVertexBot; +https://cloud.google.com/enterprise-search) |
Vertex AI Agent Builder | Same IP range |
| Gemini-Deep-Research | Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Gemini-Deep-Research; +https://gemini.google/overview/deep-research/) Chrome/135.0.0.0 Safari/537.36 |
Gemini deep research agent | Same IP range |
| GoogleAgent-Mariner | GoogleAgent-Mariner/1.0 |
Project Mariner agentic browser | Same IP range |
Meta Family
| Token (robots.txt) | Full User-Agent | Purpose | Crawl Rate |
|---|---|---|---|
| meta-externalagent | meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/crawler) |
AI training for Llama models | ~1100 pages/hr |
| Meta-WebIndexer | meta-webindexer/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/crawler) |
Meta AI search indexing | <10 pages/hr |
Other Major Players
| Token (robots.txt) | Full User-Agent | Company | Purpose |
|---|---|---|---|
| Bytespider | Mozilla/5.0 (Linux; Android 5.0) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; Bytespider; https://zhanzhang.toutiao.com/) |
ByteDance | TikTok AI training |
| Amazonbot | Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot) Chrome/119.0.6045.214 Safari/537.36 |
Amazon | Alexa/AI training |
| Applebot-Extended | Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.4 Safari/605.1.15 (Applebot/0.1; +http://www.apple.com/go/applebot) |
Apple | Apple Intelligence training |
| DuckAssistBot | DuckAssistBot/1.2; (+http://duckduckgo.com/duckassistbot.html) |
DuckDuckGo | AI answer feature |
| CCBot | CCBot/2.0 (https://commoncrawl.org/faq/) |
Common Crawl | Open training dataset |
| cohere-ai | cohere-ai/1.0 |
Cohere | Model training |
| AI2Bot | AI2Bot/1.0 |
Allen Institute | Semantic Scholar |
| Diffbot | Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2 (.NET CLR 3.5.30729; Diffbot/0.1; +http://www.diffbot.com) |
Diffbot | Structured data extraction |
| MistralAI-User | Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; MistralAI-User/1.0; +https://docs.mistral.ai/robots) |
Mistral | Le Chat citations |
| YouBot | YouBot/1.0 |
You.com | AI search |
Hard-to-Identify Crawlers
| Entity | Issue |
|---|---|
| xAI Grok | Documented agents (GrokBot/1.0, xAI-Grok/1.0, Grok-DeepSearch/1.0) rarely seen; reportedly uses iPhone UA strings |
| ChatGPT Atlas | Uses standard Chrome UA — indistinguishable from real users |
| OpenAI Operator | No declared UA; appears as Chrome from remote browser |
| Bing Copilot | No declared crawler UA; uses Bingbot's data |
Part 2: Cloudflare Radar — Real-World Crawl Share (May 2025)
Ranked by % of all AI+search crawl traffic:
| Rank | Bot | Share May '25 | YoY Change |
|---|---|---|---|
| 1 | Googlebot | 50% | +20pp |
| 2 | Bingbot | 8.7% | -1.3pp |
| 3 | GPTBot | 7.7% | +5.5pp (+305% req) |
| 4 | ClaudeBot | 5.4% | -6.3pp |
| 5 | GoogleOther | 4.3% | -0.1pp |
| 6 | Amazonbot | 4.2% | -3.4pp |
| 7 | Googlebot-Image | 3.3% | -1.2pp |
| 8 | Bytespider | 2.9% | -19.8pp (-85%) |
| 9 | Yandex | 2.2% | -0.7pp |
| 10 | ChatGPT-User | 1.3% | +1.2pp (+2825%) |
| 11 | Applebot | 1.2% | -0.7pp |
| 14 | PerplexityBot | 0.2% | +157,490% |
Part 3: Accept Header & Content Negotiation
The Accept: text/markdown Standard
This is the correct, standards-compliant way to serve Markdown to AI agents. HTTP content negotiation (RFC 9110, since 1997) lets clients request a preferred format:
Accept: text/markdown, text/html;q=0.9
Known Clients That Send Accept: text/markdown
Real-world data from 44 days of measurement (Suganthan, Mar-Apr 2026):
| Client | Requests | Notes |
|---|---|---|
| Chrome Headless (RAG pipelines) | 639 | Automated retrieval, not real browsers |
| Claude infrastructure | 500 | Anthropic's own stack (not ClaudeBot!) |
| axios (node pipelines) | 211 | Common in AI data pipelines |
| curl | 13 | Manual/scripted |
| markdown.new/1.0 | 6 | Dedicated markdown fetcher |
| qodercli/1.0 | 3 | AI coding tool (Qoder) |
| MarkdownWorker/1.0 | 1 | Another dedicated markdown fetcher |
Key finding: Tools like Claude Code, Cursor, and coding assistants actively send Accept: text/markdown. Claude's own infrastructure made up 35% of all markdown requests. The ecosystem is growing in real time.
Cloudflare Markdown for Agents
Cloudflare can auto-convert HTML → Markdown on the fly when a client sends Accept: text/markdown. Response includes:
Content-Type: text/markdown; charset=utf-8Vary: Accept(for CDN caching separation)x-markdown-tokens(estimated token count)Content-Signal: ai-train=yes, search=yes, ai-input=yes
Limitations: Chunked transfer encoding (no Content-Length) or payload >2MB causes silent fallback to HTML.
Traditional AI Crawler Accept Headers
Most major AI crawlers do not send Accept: text/markdown yet. They typically send standard browser-like Accept headers:
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
The evidence is clear: major crawlers (GPTBot, ClaudeBot, PerplexityBot, etc.) do not request /llms.txt or .md routes unprompted. But user-initiated AI tools (someone pasting your URL into ChatGPT/Claude) and coding agents (Cursor, Claude Code) do use content negotiation. This is where the value is today.
Part 4: How to Serve Markdown Instead of HTML to AI Clients
Strategy 1: Content Negotiation (Recommended, Standards-Compliant)
Serve Markdown when Accept: text/markdown is present, HTML otherwise:
// Middleware example
export async function handleRequest(request) {
const url = new URL(request.url);
const accept = request.headers.get('accept') ?? '';
if (accept.includes('text/markdown')) {
return new Response(markdownContent, {
headers: {
'Content-Type': 'text/markdown; charset=utf-8',
'Vary': 'Accept',
},
});
}
return new Response(htmlContent, {
headers: {
'Content-Type': 'text/html; charset=utf-8',
'Vary': 'Accept',
'Link': `<${url.pathname}.md>; rel="alternate"; type="text/markdown"`,
},
});
}
Strategy 2: Serve .md Routes
Every page gets a clean Markdown twin at the same URL with .md appended:
/blog/my-post → HTML
/blog/my-post.md → Markdown (Content-Type: text/markdown)
Strategy 3: llms.txt + llms-full.txt
Place at root (https://yoursite.com/llms.txt):
# Your Site Name
> One-line description for AI systems.
## Blog
- [Post Title](/blog/my-post): Brief description
- [Another Post](/blog/another): Brief description
## Documentation
- [API Reference](/docs/api): Full endpoint docs
Important: Serve llms.txt as direct 200 OK — no redirects. PerplexityBot doesn't follow redirects on this file at all.
Strategy 4: Discovery Signals
HTML <link> tag (in <head>):
<link rel="alternate" type="text/markdown" title="Markdown" href="/blog/my-post.md">
HTTP Link header:
Link: </blog/my-post.md>; rel="alternate"; type="text/markdown"
Hidden div for LLMs (when users paste URLs into chat):
<div class="visually-hidden" aria-hidden="true">
A Markdown version of this page is available at
https://yoursite.com/blog/my-post.md
</div>
Strategy 5: Server Log Monitoring
See which AI crawlers are hitting your site:
grep -Ei "gptbot|oai-searchbot|chatgpt-user|claudebot|perplexitybot|google-extended|bingbot|meta-externalagent|amazonbot|bytespider" access.log | awk '{print $1,$4,$7,$12}' | head -50
Redirect Best Practices for AI Crawlers
AI crawlers are less tolerant of redirect chains than Googlebot:
| Crawler | Max Hops Tolerated |
|---|---|
| GPTBot (training) | 5 |
| ClaudeBot (training) | 5 |
| PerplexityBot (indexing) | 5 |
| OAI-SearchBot (real-time) | 3 |
| Claude-SearchBot (real-time) | 3 |
| Perplexity-User (on-demand) | 3 |
| Googlebot (reference) | 10 |
Target: 1 hop maximum for pages you want cited in AI answers.
Part 5: Quick Reference — robots.txt Tokens
# ——— TRAINING CRAWLERS (block if you don't want content used for training) ———
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: Applebot-Extended
Disallow: /
User-agent: meta-externalagent
Disallow: /
User-agent: cohere-ai
Disallow: /
User-agent: AI2Bot
Disallow: /
# ——— SEARCH / CITATION CRAWLERS (allow if you want AI search visibility) ———
User-agent: OAI-SearchBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Claude-SearchBot
Allow: /
User-agent: DuckAssistBot
Allow: /
# ——— USER-TRIGGERED AGENTS (allow for chat citations) ———
User-agent: ChatGPT-User
Allow: /
User-agent: Perplexity-User
Allow: /
User-agent: MistralAI-User
Allow: /
User-agent: Claude-User
Allow: /
Sources
- Search Engine Journal — AI Crawler User-Agent List (Dec 2025)
- Cloudflare Radar — From Googlebot to GPTBot (Jul 2025)
- Evil Martians — How to Make Your Website Visible to LLMs
- CaptainDNS — AI Crawlers Redirects Handling
- Cloudflare Docs — Markdown for Agents
- Suganthan — 44 Days of Markdown-for-Agents Data (Apr 2026)
- Momentic Marketing — AI Crawlers List (Nov 2025)
- ai.robots.txt — GitHub Community List
- llmstxt.org — Specification