Skip to content

AI Crawlers Ignore llms.txt — But AI Agents Don't

9 min read
Bart Waardenburg

Bart Waardenburg

AI Agent Readiness Expert & Founder

Dries Buytaert, founder of Drupal, recently published a data-driven analysis of llms.txt and markdown adoption by AI crawlers. His conclusion: zero AI crawlers accessed his llms.txt file, markdown pages increased total crawl traffic by 7%, and no crawler used HTTP content negotiation. He called llms.txt "a solution looking for a problem."

The data is solid. The conclusion is wrong, because he measured the wrong thing.

What the Data Actually Shows

Dries analyzed his Cloudflare logs after making all his pages available as markdown files. The findings are worth taking seriously:

AI CRAWLERS ACCESSING LLMS.TXT
0
CRAWL TRAFFIC INCREASE FROM .MD PAGES
+7%
CRAWLERS USING CONTENT NEGOTIATION
0
PAGES CRAWLED PER CITATION SENT BACK
1,241

Across Acquia's entire hosting infrastructure, one of the largest Drupal hosting platforms, llms.txt represented just 0.001% of 400 million requests. All 52 requests to llms.txt came from SEO audit tools, not AI systems.

Leon Furze ran a similar experiment on his WordPress blog. Same result: markdown and HTML pages crawled at roughly the same rate, no measurable traffic difference, and llms.txt made no visible impact on crawler behavior.

The data is clear: AI crawlers don't use llms.txt. But that's like measuring how many trucks use your bike lane and concluding bike lanes are useless.

Crawlers and Agents Are Fundamentally Different

Dries' analysis has a blind spot: it only looks at one half of the equation. Crawling and training is not the only way AI systems interact with web content. The distinction that matters:

AI Crawlers AI Agents
Purpose Scrape content for training data Complete a task for a specific user
Behavior Mass crawl, grab everything Targeted fetch, get what's needed
Token efficiency Irrelevant — data is preprocessed offline Critical — every token costs time and money
Content format HTML is fine, they strip it anyway Markdown saves 80% of tokens
Discovery Sitemaps, link crawling llms.txt, content negotiation, tool manifests
Examples GPTBot, ClaudeBot, Google-Extended Claude Code, Cursor, Windsurf, Bun

AI crawlers are built to hoover up the web. They have established pipelines optimized for HTML scraping, built years ago. They'd be silly to change that setup just because a few sites now offer raw markdown.

AI agents are the opposite. They fetch specific pages to solve a specific task, and every token counts. A blog post that's 20% content and 80% navigation HTML? Wasteful. Markdown and llms.txt solve that problem directly.

Coding Agents Are Already Using These Standards

Look beyond crawler logs and there's already concrete agent-side adoption happening:

Claude Code

Anthropic's coding agent sends Accept headers that prefer markdown when fetching documentation. It also looks for llms.txt to discover relevant content on a site.

Bun

The JavaScript runtime started sending content negotiation headers when fetching documentation pages, preferring markdown when available.

Cursor & Windsurf

AI-powered code editors fetch documentation to help developers. They benefit directly from markdown versions that preserve structure without HTML noise.

Cloudflare

Now offers content negotiation and markdown transformation in its paid tiers — a clear signal that platform providers see demand from the agent side.

Some documentation platforms have already started putting "agent directives" on pages pointing agents to llms.txt for content discovery. The pattern is clear: content negotiation and llms.txt adoption is being driven by the agentic developer tooling space. Not by the training pipeline.

Adoption Is Industry-Specific

Another factor Dries' analysis misses: llms.txt and markdown adoption is heavily skewed toward developer documentation. Dries runs a personal blog, not a docs site. The use case is different.

Developer documentation is where coding agents spend most of their time. When Claude Code needs to understand a library API, or Cursor needs to look up a framework's configuration options, they're fetching documentation pages. Exactly the pages where:

  • Markdown versions save the most tokens (docs pages are heavy on navigation and sidebars)
  • llms.txt provides a curated entry point to the most relevant pages
  • Content negotiation allows agents to get clean content without the UI chrome

Vercel, Cloudflare, Stripe, and other developer-facing companies have already implemented these standards. The Vercel State of AEO report explicitly recommends llms.txt as part of a comprehensive AI visibility strategy. Vercel even built AEO tracking for coding agents to measure this adoption.

Why Crawlers Will Probably Never Use llms.txt

Understanding why crawlers ignore llms.txt makes the distinction even clearer:

  • Scale economics. Crawlers process billions of pages. Adding a curated discovery step per domain adds complexity for minimal gain. They already have sitemaps and link graphs
  • Training incentives. More data is better for training. A curated llms.txt that points to 20 key pages is the opposite of what a training pipeline wants
  • Existing infrastructure. HTML scraping pipelines are mature and battle-tested. There's no business case to rebuild them for markdown
  • Content control concerns. Why would they bother with a curated list? They get more context if they take everything. The incentives are misaligned

This is not a failure of llms.txt. It's confirmation that llms.txt was never meant for crawlers in the first place.

Readiness Is Not About Today's ROI

Dries' article concludes with practical advice: focus on "clear writing, authoritative content, and timely publishing" rather than llms.txt. That advice isn't wrong. But it's incomplete.

The same argument was made about mobile optimization in 2010, about HTTPS in 2014, and about structured data in 2018. Every time, early adopters who invested before the wave hit were rewarded when adoption tipped. The sites that waited got to scramble.

The agent ecosystem is growing fast. Coding agents are becoming the default way developers interact with documentation, and AI-powered browsing agents like ChatGPT Search and Claude Search are maturing. Sites that are already machine-readable will have a structural advantage.

What You Should Actually Implement

Based on where agent adoption actually is, not where crawler adoption is, here's what matters:

1. llms.txt

Create a curated entry point for agents. List your most important pages with brief descriptions. Low effort, high signal for any agent that looks for it.

2. Content Negotiation

Serve markdown when agents request it via Accept headers. Cloudflare offers this out of the box. Saves agents 80% of token overhead.

3. Structured Data

JSON-LD, Schema.org types, and FAQPage schema help both crawlers and agents understand your content. This is table stakes, 8x visibility difference for ChatGPT.

4. Crawler Access

Allow AI crawlers in robots.txt. Block training bots if you want, but keep search bots open. This is the baseline. No access means no visibility.

The first two are agent-specific. The last two help both crawlers and agents. Together, they cover the full spectrum of how AI systems interact with your content.

The Bottom Line

Dries' data is accurate: AI crawlers don't use llms.txt. But measuring llms.txt adoption by crawler behavior is like measuring the success of an API by how many web browsers access it. The audience is different.

AI agents, coding assistants, browsing agents, task automation tools, are the actual consumers of llms.txt and content negotiation. They're smaller in volume than crawlers but growing fast. They represent the future of how software interacts with web content.

"Do AI crawlers use llms.txt today?" is the wrong question. The right one: when agents become the primary way users interact with your content, will your site be ready?

Sources

Ready to check?

SCAN YOUR WEBSITE

Get your AI agent readiness score with actionable recommendations across 5 categories.

  • Free instant scan with letter grade
  • 5 categories, 47 checkpoints
  • Code examples for every recommendation

RELATED ARTICLES

Continue reading about AI agent readiness and web optimization.

Content Negotiation for AI Agents: Why Sentry Serves Markdown Over HTML
9 min read

Content Negotiation for AI Agents: Why Sentry Serves Markdown Over HTML

Sentry co-founder David Cramer shows how content negotiation — a 25-year-old HTTP standard — saves AI agents 80% of tokens. We break down the implementation: Accept headers, markdown delivery, authenticated page redirects, and what this means for every website preparing for agent traffic.

ai-agents seo getting-started
Cloudflare /crawl Endpoint: One API Call to Crawl Any Website
9 min read

Cloudflare /crawl Endpoint: One API Call to Crawl Any Website

Cloudflare launched a /crawl endpoint that crawls entire websites with one API call — returning HTML, Markdown, or AI-extracted JSON. We break down what this means for AI agent readiness: why your robots.txt, sitemap, semantic HTML, and server-side rendering now matter more than ever.

ai-agents seo getting-started
Anthropic's AI Exposure Index: What Real-World Usage Data Means for Your Website
12 min read

Anthropic's AI Exposure Index: What Real-World Usage Data Means for Your Website

Anthropic's new 'observed exposure' metric reveals a 61-point gap between theoretical AI capability and actual usage. We break down the data — from 75% task coverage for programmers to 14% hiring slowdowns for young workers — and explain why this adoption gap is a countdown for website AI agent readiness.

ai-agents seo getting-started

EXPLORE MORE

Most websites score under 45. Find out where you stand.

RANKINGS
SEE HOW OTHERS SCORE

RANKINGS

Browse AI readiness scores for scanned websites.
COMPARE
HEAD TO HEAD

COMPARE

Compare two websites side-by-side across all 5 categories and 47 checkpoints.
ABOUT
HOW WE MEASURE

ABOUT

Learn about our 5-category scoring methodology.