Skip to content

Cloudflare /crawl Endpoint: One API Call to Crawl Any Website

9 min read
Bart Waardenburg

Bart Waardenburg

AI Agent Readiness Expert & Founder

Cloudflare just launched a new /crawl endpoint for its Browser Rendering service. One POST request, one URL, and Cloudflare crawls your entire site — returning HTML, Markdown, or AI-extracted structured JSON. It's in open beta as of March 10, 2026, available on both free and paid Workers plans.

For AI agent readiness, this is a big deal. Cloudflare is building the infrastructure that makes it trivially easy for anyone to build crawling agents. Your site's machine-readability just became testable at scale.

How the /crawl Endpoint Works

The endpoint is asynchronous. You submit a starting URL, get back a job ID, and poll for results as pages complete. It's designed for full-site crawls, not single-page fetches.

# 1. Start a crawl
curl -X POST \
  https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl \
  -H "Authorization: Bearer {token}" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com", "limit": 100, "formats": ["markdown"]}'

# Response: {"success": true, "result": "c7f8s2d9-a8e7-..." }

# 2. Poll for results
curl https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl/{job_id}

The configuration options are where it gets interesting for agent readiness:

MAX PAGES PER CRAWL
100K
OUTPUT FORMATS
3
MAX JOB RUNTIME
7 days
RESULT RETENTION
14 days

Three Output Formats — And Why Markdown Matters

The /crawl endpoint supports three output formats, each serving a different use case:

HTML

Raw page HTML including all markup. Useful for traditional scraping, archiving, or when you need the full DOM structure.

Markdown

Clean content stripped of navigation, headers, and boilerplate. Ideal for AI agents and LLM context windows — saves up to 80% of tokens.

JSON (AI-extracted)

Structured data extracted by Workers AI using a custom prompt. Define your schema and let the model extract exactly the fields you need.

The Markdown option is the most relevant for AI agent readiness. When an agent crawls your site with formats: ["markdown"], it gets clean content that fits efficiently into an LLM's context window. Sites with clear semantic HTML, proper heading hierarchy, and meaningful content structure will produce better markdown output than sites heavy on JavaScript-rendered widgets and nested divs.

The JSON format takes it further — using Workers AI to extract structured data with a custom prompt and schema. This is essentially automated structured data extraction at crawl scale:

{
  "url": "https://example.com/products",
  "limit": 500,
  "formats": ["json"],
  "jsonOptions": {
    "prompt": "Extract product name, price, and availability",
    "response_format": {
      "type": "json_schema",
      "json_schema": {
        "name": "product",
        "properties": {
          "name": "string",
          "price": "number",
          "inStock": "boolean"
        }
      }
    }
  }
}

Crawl Scope and Discovery

The endpoint offers fine-grained control over what gets crawled and how URLs are discovered:

Parameter Default Purpose
limit 10 Max pages to crawl (up to 100,000)
depth 100,000 Max link depth from starting URL
source all Discovery method: sitemaps, links, or all
render true Execute JavaScript (false = fast static HTML fetch)
includePatterns Wildcard patterns to include (e.g. /blog/**)
excludePatterns Patterns to skip (takes priority over includes)

The source parameter is particularly telling. Setting it to sitemaps means the crawler discovers pages through your XML sitemaps only — exactly how search engine crawlers work. Sites with comprehensive, up-to-date sitemaps will be crawled more completely. Sites without sitemaps fall back to link discovery, which may miss orphaned pages.

The render toggle is equally important. Setting render: false skips the headless browser entirely and fetches static HTML. This is faster and cheaper, but it means JavaScript-rendered content is invisible. Sites relying on client-side rendering for their main content will return empty pages in static mode. Server-side rendered sites work perfectly.

robots.txt Is Your First Line of Defense — And Your Biggest Opportunity

The /crawl endpoint respects robots.txt fully, including crawl-delay directives. URLs blocked by robots.txt appear in results with "status": "disallowed". This means:

  • If you block AI crawlers, the /crawl endpoint won't access those pages. You control what gets indexed
  • If your robots.txt is misconfigured, you might be blocking legitimate agent access without knowing it. Many sites accidentally block all bots to prevent training data scraping, losing agent visibility in the process
  • If you set crawl-delay, the endpoint honors it. This gives you rate control over automated access

This is the first major crawling service that explicitly operates as a signed bot — it identifies itself as automated and cannot bypass bot detection, CAPTCHAs, or Cloudflare protection. It's the kind of compliant crawler that robots.txt was designed for.

What This Means for AI Agent Readiness

Cloudflare's /crawl endpoint is infrastructure, not an agent itself. But it dramatically lowers the barrier for building agent systems that need to understand entire websites. Here's why this matters:

Democratized crawling

Anyone with a Cloudflare account can now crawl up to 100,000 pages with one API call. Building a RAG pipeline, knowledge base, or competitive analysis tool just got trivial. Your site will be crawled.

Markdown as default

The markdown output option signals that clean, structured content is the expected format for AI consumption. Sites with good semantic HTML produce better markdown automatically.

Structured data extraction

The AI-powered JSON extraction means your content's structure directly affects what data can be extracted. Schema.org markup, clear headings, and consistent patterns make extraction more accurate.

Compliant by design

Unlike scraping libraries, this is a bot that respects robots.txt and crawl-delay. The agent readiness signals you set up — crawler directives, rate limiting, access policies — actually work here.

Agent Readiness Checklist for the /crawl Era

With crawling infrastructure this accessible, here's what to prioritize:

  1. Audit your robots.txt. Make sure you're not accidentally blocking compliant AI crawlers. Block training bots if you want, but keep agent access open
  2. Maintain your sitemap. The source: "sitemaps" option means your sitemap is a direct input for how completely your site gets crawled
  3. Use server-side rendering. The render: false option is faster and cheaper. Sites that work without JavaScript will be crawled more efficiently
  4. Improve semantic HTML. Clean heading hierarchy, proper landmarks, descriptive link text — all of this produces better markdown output when your site is crawled
  5. Add structured data. JSON-LD and Schema.org types help both the HTML-to-markdown conversion and AI-powered JSON extraction produce accurate results
  6. Serve llms.txt. While the /crawl endpoint uses sitemaps and links for discovery, agents that consume the crawled content often start with llms.txt to understand what a site offers

The Bigger Picture

Cloudflare's /crawl endpoint joins a pattern of major platforms investing in agent infrastructure. Cloudflare already offers MCP integration through its Workers AI platform, and supports Playwright MCP for browser automation. The /crawl endpoint adds the missing piece: scalable, compliant, full-site crawling with AI-native output formats.

We're moving from a web where crawlers scraped HTML for search indexes to one where agents crawl for understanding. The output isn't a search ranking — it's a knowledge base, a RAG pipeline, a structured dataset. Sites that are already optimized for machine readability will naturally produce better results in this new paradigm.

The question isn't whether your site will be crawled by AI-powered tools. It's whether the output will accurately represent what your site offers.

Sources

Ready to check?

SCAN YOUR WEBSITE

Get your AI agent readiness score with actionable recommendations across 5 categories.

  • Free instant scan with letter grade
  • 5 categories, 47 checkpoints
  • Code examples for every recommendation

RELATED ARTICLES

Continue reading about AI agent readiness and web optimization.

Content Negotiation for AI Agents: Why Sentry Serves Markdown Over HTML
9 min read

Content Negotiation for AI Agents: Why Sentry Serves Markdown Over HTML

Sentry co-founder David Cramer shows how content negotiation — a 25-year-old HTTP standard — saves AI agents 80% of tokens. We break down the implementation: Accept headers, markdown delivery, authenticated page redirects, and what this means for every website preparing for agent traffic.

ai-agents seo getting-started
AI Crawlers Ignore llms.txt — But AI Agents Don't
9 min read

AI Crawlers Ignore llms.txt — But AI Agents Don't

Dries Buytaert's data shows zero AI crawlers use llms.txt. But he measured the wrong thing. Crawlers scrape for training data — agents complete tasks. We break down why the crawler vs agent distinction matters, which coding agents already use llms.txt and content negotiation, and what you should implement today.

ai-agents seo getting-started
Anthropic's AI Exposure Index: What Real-World Usage Data Means for Your Website
12 min read

Anthropic's AI Exposure Index: What Real-World Usage Data Means for Your Website

Anthropic's new 'observed exposure' metric reveals a 61-point gap between theoretical AI capability and actual usage. We break down the data — from 75% task coverage for programmers to 14% hiring slowdowns for young workers — and explain why this adoption gap is a countdown for website AI agent readiness.

ai-agents seo getting-started

EXPLORE MORE

Most websites score under 45. Find out where you stand.

RANKINGS
SEE HOW OTHERS SCORE

RANKINGS

Browse AI readiness scores for scanned websites.
COMPARE
HEAD TO HEAD

COMPARE

Compare two websites side-by-side across all 5 categories and 47 checkpoints.
ABOUT
HOW WE MEASURE

ABOUT

Learn about our 5-category scoring methodology.