Cloudflare /crawl Endpoint: One API Call to Crawl Any Website
Cloudflare just launched a new /crawl endpoint for its Browser Rendering service. One POST request, one URL, and Cloudflare crawls your entire site — returning HTML, Markdown, or AI-extracted structured JSON. It's in open beta as of March 10, 2026, available on both free and paid Workers plans.
For AI agent readiness, this is a big deal. Cloudflare is building the infrastructure that makes it trivially easy for anyone to build crawling agents. Your site's machine-readability just became testable at scale.
How the /crawl Endpoint Works
The endpoint is asynchronous. You submit a starting URL, get back a job ID, and poll for results as pages complete. It's designed for full-site crawls, not single-page fetches.
# 1. Start a crawl
curl -X POST \
https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl \
-H "Authorization: Bearer {token}" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com", "limit": 100, "formats": ["markdown"]}'
# Response: {"success": true, "result": "c7f8s2d9-a8e7-..." }
# 2. Poll for results
curl https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl/{job_id}
The configuration options are where it gets interesting for agent readiness:
Three Output Formats — And Why Markdown Matters
The /crawl endpoint supports three output formats, each serving a different use case:
HTML
Raw page HTML including all markup. Useful for traditional scraping, archiving, or when you need the full DOM structure.
Markdown
Clean content stripped of navigation, headers, and boilerplate. Ideal for AI agents and LLM context windows — saves up to 80% of tokens.
JSON (AI-extracted)
Structured data extracted by Workers AI using a custom prompt. Define your schema and let the model extract exactly the fields you need.
The Markdown option is the most relevant for AI agent readiness. When an agent crawls your site with formats: ["markdown"], it gets clean content that fits efficiently into an LLM's context window. Sites with clear semantic HTML, proper heading hierarchy, and meaningful content structure will produce better markdown output than sites heavy on JavaScript-rendered widgets and nested divs.
The JSON format takes it further — using Workers AI to extract structured data with a custom prompt and schema. This is essentially automated structured data extraction at crawl scale:
{
"url": "https://example.com/products",
"limit": 500,
"formats": ["json"],
"jsonOptions": {
"prompt": "Extract product name, price, and availability",
"response_format": {
"type": "json_schema",
"json_schema": {
"name": "product",
"properties": {
"name": "string",
"price": "number",
"inStock": "boolean"
}
}
}
}
}
Crawl Scope and Discovery
The endpoint offers fine-grained control over what gets crawled and how URLs are discovered:
| Parameter | Default | Purpose |
|---|---|---|
limit |
10 | Max pages to crawl (up to 100,000) |
depth |
100,000 | Max link depth from starting URL |
source |
all | Discovery method: sitemaps, links, or all |
render |
true | Execute JavaScript (false = fast static HTML fetch) |
includePatterns |
— | Wildcard patterns to include (e.g. /blog/**) |
excludePatterns |
— | Patterns to skip (takes priority over includes) |
The source
parameter is particularly telling. Setting it to sitemaps
means the crawler discovers pages through your XML sitemaps only — exactly how search engine crawlers work. Sites with comprehensive, up-to-date sitemaps will be crawled more completely. Sites without sitemaps fall back to link discovery, which may miss orphaned pages.
The render
toggle is equally important. Setting render: false
skips the headless browser entirely and fetches static HTML. This is faster and cheaper, but it means JavaScript-rendered content is invisible. Sites relying on client-side rendering for their main content will return empty pages in static mode. Server-side rendered sites work perfectly.
robots.txt Is Your First Line of Defense — And Your Biggest Opportunity
The /crawl endpoint respects robots.txt fully, including crawl-delay
directives. URLs blocked by robots.txt appear in results with "status": "disallowed". This means:
- If you block AI crawlers, the /crawl endpoint won't access those pages. You control what gets indexed
- If your robots.txt is misconfigured, you might be blocking legitimate agent access without knowing it. Many sites accidentally block all bots to prevent training data scraping, losing agent visibility in the process
- If you set crawl-delay, the endpoint honors it. This gives you rate control over automated access
This is the first major crawling service that explicitly operates as a signed bot — it identifies itself as automated and cannot bypass bot detection, CAPTCHAs, or Cloudflare protection. It's the kind of compliant crawler that robots.txt was designed for.
What This Means for AI Agent Readiness
Cloudflare's /crawl endpoint is infrastructure, not an agent itself. But it dramatically lowers the barrier for building agent systems that need to understand entire websites. Here's why this matters:
Democratized crawling
Anyone with a Cloudflare account can now crawl up to 100,000 pages with one API call. Building a RAG pipeline, knowledge base, or competitive analysis tool just got trivial. Your site will be crawled.
Markdown as default
The markdown output option signals that clean, structured content is the expected format for AI consumption. Sites with good semantic HTML produce better markdown automatically.
Structured data extraction
The AI-powered JSON extraction means your content's structure directly affects what data can be extracted. Schema.org markup, clear headings, and consistent patterns make extraction more accurate.
Compliant by design
Unlike scraping libraries, this is a bot that respects robots.txt and crawl-delay. The agent readiness signals you set up — crawler directives, rate limiting, access policies — actually work here.
Agent Readiness Checklist for the /crawl Era
With crawling infrastructure this accessible, here's what to prioritize:
- Audit your robots.txt. Make sure you're not accidentally blocking compliant AI crawlers. Block training bots if you want, but keep agent access open
-
Maintain your sitemap. The
source: "sitemaps"option means your sitemap is a direct input for how completely your site gets crawled -
Use server-side rendering. The
render: falseoption is faster and cheaper. Sites that work without JavaScript will be crawled more efficiently - Improve semantic HTML. Clean heading hierarchy, proper landmarks, descriptive link text — all of this produces better markdown output when your site is crawled
- Add structured data. JSON-LD and Schema.org types help both the HTML-to-markdown conversion and AI-powered JSON extraction produce accurate results
- Serve llms.txt. While the /crawl endpoint uses sitemaps and links for discovery, agents that consume the crawled content often start with llms.txt to understand what a site offers
The Bigger Picture
Cloudflare's /crawl endpoint joins a pattern of major platforms investing in agent infrastructure. Cloudflare already offers MCP integration through its Workers AI platform, and supports Playwright MCP for browser automation. The /crawl endpoint adds the missing piece: scalable, compliant, full-site crawling with AI-native output formats.
We're moving from a web where crawlers scraped HTML for search indexes to one where agents crawl for understanding. The output isn't a search ranking — it's a knowledge base, a RAG pipeline, a structured dataset. Sites that are already optimized for machine readability will naturally produce better results in this new paradigm.
The question isn't whether your site will be crawled by AI-powered tools. It's whether the output will accurately represent what your site offers.
Sources
- Cloudflare Changelog: Browser Rendering /crawl Endpoint (Open Beta) — Official announcement, March 10, 2026
- Cloudflare Docs: /crawl Endpoint Technical Documentation — Full API reference with configuration options
- Cloudflare Browser Rendering Overview — Product overview and use cases
- IsAgentReady: AI Crawlers Ignore llms.txt — But AI Agents Don't — Why the crawler vs agent distinction matters