Cloudflare /crawl Endpoint: One API Call to Crawl Any Website

March 11, 2026 • 9 min read

Bart Waardenburg

AI Agent Readiness Expert & Founder

Cloudflare just launched a new /crawl endpoint for its Browser Rendering service. One POST request, one URL, and Cloudflare crawls your entire site — returning HTML, Markdown, or AI-extracted structured JSON. It's in open beta as of March 10, 2026, available on both free and paid Workers plans.

For AI agent readiness, this is a big deal. Cloudflare is building the infrastructure that makes it trivially easy for anyone to build crawling agents. Your site's machine-readability just became testable at scale.

How the /crawl Endpoint Works

The endpoint is asynchronous. You submit a starting URL, get back a job ID, and poll for results as pages complete. It's designed for full-site crawls, not single-page fetches.

# 1. Start a crawl
curl -X POST \
  https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl \
  -H "Authorization: Bearer {token}" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com", "limit": 100, "formats": ["markdown"]}'

# Response: {"success": true, "result": "c7f8s2d9-a8e7-..." }

# 2. Poll for results
curl https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl/{job_id}

The configuration options are where it gets interesting for agent readiness:

MAX PAGES PER CRAWL

100K

OUTPUT FORMATS

MAX JOB RUNTIME

7 days

RESULT RETENTION

14 days

Three Output Formats — And Why Markdown Matters

The /crawl endpoint supports three output formats, each serving a different use case:

HTML

Raw page HTML including all markup. Useful for traditional scraping, archiving, or when you need the full DOM structure.

Markdown

Clean content stripped of navigation, headers, and boilerplate. Ideal for AI agents and LLM context windows — saves up to 80% of tokens.

JSON (AI-extracted)

Structured data extracted by Workers AI using a custom prompt. Define your schema and let the model extract exactly the fields you need.

The Markdown option is the most relevant for AI agent readiness. When an agent crawls your site with formats: ["markdown"], it gets clean content that fits efficiently into an LLM's context window. Sites with clear semantic HTML, proper heading hierarchy, and meaningful content structure will produce better markdown output than sites heavy on JavaScript-rendered widgets and nested divs.

The JSON format takes it further — using Workers AI to extract structured data with a custom prompt and schema. This is essentially automated structured data extraction at crawl scale:

{
  "url": "https://example.com/products",
  "limit": 500,
  "formats": ["json"],
  "jsonOptions": {
    "prompt": "Extract product name, price, and availability",
    "response_format": {
      "type": "json_schema",
      "json_schema": {
        "name": "product",
        "properties": {
          "name": "string",
          "price": "number",
          "inStock": "boolean"
        }
      }
    }
  }
}

Crawl Scope and Discovery

The endpoint offers fine-grained control over what gets crawled and how URLs are discovered:

Parameter	Default	Purpose
`limit`	10	Max pages to crawl (up to 100,000)
`depth`	100,000	Max link depth from starting URL
`source`	all	Discovery method: `sitemaps`, `links`, or `all`
`render`	true	Execute JavaScript (false = fast static HTML fetch)
`includePatterns`	—	Wildcard patterns to include (e.g. `/blog/**`)
`excludePatterns`	—	Patterns to skip (takes priority over includes)

The source parameter is particularly telling. Setting it to sitemaps means the crawler discovers pages through your XML sitemaps only — exactly how search engine crawlers work. Sites with comprehensive, up-to-date sitemaps will be crawled more completely. Sites without sitemaps fall back to link discovery, which may miss orphaned pages.

The render toggle is equally important. Setting render: false skips the headless browser entirely and fetches static HTML. This is faster and cheaper, but it means JavaScript-rendered content is invisible. Sites relying on client-side rendering for their main content will return empty pages in static mode. Server-side rendered sites work perfectly.

robots.txt Is Your First Line of Defense — And Your Biggest Opportunity

The /crawl endpoint respects robots.txt fully, including crawl-delay directives. URLs blocked by robots.txt appear in results with "status": "disallowed". This means:

If you block AI crawlers, the /crawl endpoint won't access those pages. You control what gets indexed
If your robots.txt is misconfigured, you might be blocking legitimate agent access without knowing it. Many sites accidentally block all bots to prevent training data scraping, losing agent visibility in the process
If you set crawl-delay, the endpoint honors it. This gives you rate control over automated access

This is the first major crawling service that explicitly operates as a signed bot — it identifies itself as automated and cannot bypass bot detection, CAPTCHAs, or Cloudflare protection. It's the kind of compliant crawler that robots.txt was designed for.

What This Means for AI Agent Readiness

Cloudflare's /crawl endpoint is infrastructure, not an agent itself. But it dramatically lowers the barrier for building agent systems that need to understand entire websites. Here's why this matters:

Democratized crawling

Anyone with a Cloudflare account can now crawl up to 100,000 pages with one API call. Building a RAG pipeline, knowledge base, or competitive analysis tool just got trivial. Your site will be crawled.

Markdown as default

The markdown output option signals that clean, structured content is the expected format for AI consumption. Sites with good semantic HTML produce better markdown automatically.

Structured data extraction

The AI-powered JSON extraction means your content's structure directly affects what data can be extracted. Schema.org markup, clear headings, and consistent patterns make extraction more accurate.

Compliant by design

Unlike scraping libraries, this is a bot that respects robots.txt and crawl-delay. The agent readiness signals you set up — crawler directives, rate limiting, access policies — actually work here.

Agent Readiness Checklist for the /crawl Era

With crawling infrastructure this accessible, here's what to prioritize:

Audit your robots.txt. Make sure you're not accidentally blocking compliant AI crawlers. Block training bots if you want, but keep agent access open
Maintain your sitemap. The source: "sitemaps" option means your sitemap is a direct input for how completely your site gets crawled
Use server-side rendering. The render: false option is faster and cheaper. Sites that work without JavaScript will be crawled more efficiently
Improve semantic HTML. Clean heading hierarchy, proper landmarks, descriptive link text — all of this produces better markdown output when your site is crawled
Add structured data. JSON-LD and Schema.org types help both the HTML-to-markdown conversion and AI-powered JSON extraction produce accurate results
Serve llms.txt. While the /crawl endpoint uses sitemaps and links for discovery, agents that consume the crawled content often start with llms.txt to understand what a site offers

The Bigger Picture

Cloudflare's /crawl endpoint joins a pattern of major platforms investing in agent infrastructure. Cloudflare already offers MCP integration through its Workers AI platform, and supports Playwright MCP for browser automation. The /crawl endpoint adds the missing piece: scalable, compliant, full-site crawling with AI-native output formats.

We're moving from a web where crawlers scraped HTML for search indexes to one where agents crawl for understanding. The output isn't a search ranking — it's a knowledge base, a RAG pipeline, a structured dataset. Sites that are already optimized for machine readability will naturally produce better results in this new paradigm.

The question isn't whether your site will be crawled by AI-powered tools. It's whether the output will accurately represent what your site offers.

Sources

Cloudflare Changelog: Browser Rendering /crawl Endpoint (Open Beta) — Official announcement, March 10, 2026
Cloudflare Docs: /crawl Endpoint Technical Documentation — Full API reference with configuration options
Cloudflare Browser Rendering Overview — Product overview and use cases
IsAgentReady: AI Crawlers Ignore llms.txt — But AI Agents Don't — Why the crawler vs agent distinction matters

Cloudflare /crawl Endpoint: One API Call to Crawl Any Website

How the /crawl Endpoint Works

Three Output Formats — And Why Markdown Matters

HTML

Markdown

JSON (AI-extracted)

Crawl Scope and Discovery

robots.txt Is Your First Line of Defense — And Your Biggest Opportunity

What This Means for AI Agent Readiness

Democratized crawling

Markdown as default

Structured data extraction

Compliant by design

Agent Readiness Checklist for the /crawl Era

The Bigger Picture

Sources

SCAN YOUR WEBSITE

RELATED ARTICLES

Does Schema Markup Get You Cited by AI? What the Data Actually Shows

Content Negotiation for AI Agents: Why Sentry Serves Markdown Over HTML

AI Crawlers Ignore llms.txt — But AI Agents Don't

EXPLORE MORE

RANKINGS

COMPARE

ABOUT