Skip to content

How ChatGPT Search Chooses Which Websites to Cite

10 min read
Bart Waardenburg

Bart Waardenburg

AI Agent Readiness Expert & Founder

ChatGPT Search has quickly changed how people find information online. No more scanning ten blue links. Millions of users now get a synthesized answer with inline citations, and the websites that get cited capture a new kind of organic traffic. But how does ChatGPT decide which sites make the cut?

I went through OpenAI's official documentation, multiple large-scale citation studies, and empirical data to find out. The short version: ChatGPT's source selection is surprisingly different from traditional search, and understanding those differences is a real competitive advantage. For a broader look at how AI optimization compares to traditional SEO, see the guide on SEO vs AEO .

The Search Infrastructure Behind ChatGPT

ChatGPT Search doesn't crawl the web in real-time when you ask a question. It relies on a pre-built search index. And that's where things get interesting.

ChatGPT Search was originally built on Bing's search index . However, two independent studies from mid-2025 found evidence that ChatGPT's paid version also uses Google Search results. As of May 2025, OpenAI added Shopify as a third-party search provider for commerce queries. So ChatGPT isn't locked to a single search backend. It aggregates.

The Three Crawlers You Need to Know

OpenAI operates three web crawlers, each with a different purpose. Getting this right is the first gate to ChatGPT visibility. The breakdown from OpenAI's official crawler documentation :

OAI-SEARCHBOT

Indexes content specifically for ChatGPT Search results. Blocking this bot means your site won't appear in ChatGPT search answers.

GPTBOT

Crawls content for AI model training. You can block this without affecting your search visibility.

CHATGPT-USER

Fetches pages on demand when a user explicitly asks ChatGPT to browse a specific URL.

A detail you might miss in the docs: OAI-SearchBot and GPTBot share crawl data. If your site allows both bots, OpenAI can use a single crawl for both purposes to avoid duplicate requests.

The recommended robots.txt configuration for maximum ChatGPT visibility:

robots.txt plain
# ChatGPT search indexing (required for search visibility)
User-agent: OAI-SearchBot
Allow: /

# AI model training (optional - blocking this does NOT affect search)
User-agent: GPTBot
Allow: /

# User-initiated browsing (robots.txt ignored since Dec 2025)
User-agent: ChatGPT-User
Allow: /

From OpenAI's Publishers FAQ : "Any public website can appear in ChatGPT search." So you can appear in search even if you opt out of AI training by blocking GPTBot. The crawlers are independent.

Domain Authority Is the #1 Factor

The most comprehensive study on ChatGPT citations comes from Wellows (7,785 queries, 485,000+ citations). The conclusion is clear: referring domains (backlinks) is the single strongest predictor of whether ChatGPT cites a website.

2,500 REF. DOMAINS
1.6
350,000+ REF. DOMAINS
8.4
AVG CITATIONS/QUERY
~5

Sites with 2,500 referring domains averaged 1.6 citations per query. Sites with 350,000+ referring domains: 8.4. Five times more. Domain traffic is the second factor, but that correlation only shows up at very high volumes.

Confirmed by Search Engine Journal's analysis of the same dataset: domain-level authority outweighs page-level metrics. ChatGPT trusts the domain more than the individual page.

Google ranking position does correlate with ChatGPT citations. Positions 1-45 averaged 5 citations versus 3.1 for positions 64-75. But that's probably because the same signals (backlinks, authority) drive both.

The Long Tail Opportunity

Good news if you're not Wikipedia. The Wellows report found that the top 50 websites capture only 48% of all mentions. The remaining 52% goes to smaller, niche sites.

TOP 50 DOMAINS
0
LONG TAIL
0

The Profound AI Search Shift study adds something encouraging: only a small portion of ChatGPT's citations match Google search results. ChatGPT maintains largely independent source selection. Don't rank well on Google? You can still get cited by ChatGPT if you have authority and good content structure.

How to Structure Content for ChatGPT Citations

A study by Search Engine Land analyzed 3 million ChatGPT responses and 30 million citations. The findings are worth your time:

Front-Load Your Key Information

44.2% of citations come from the first 30% of content in a consistent "ski ramp" pattern. Information at the top of your article gets cited far more than content buried at the bottom. The opposite of the "inverted pyramid" from journalism, more like an encyclopedia-style opening.

Use Q&A Heading Structure

Conversational Q&A structure doubles citation likelihood. 78.4% of citations tied to questions came from H2 headings. ChatGPT treats your H2s as prompts and the paragraph below as the answer. Write your headings as questions ("How does X work?" or "What is Y?") and you're directly matching how users query ChatGPT.

Pack in Specific Entities

Cited text averaged 20.6% proper nouns (versus 5-8% in typical English). Specific brands, tools, people, and place names reduce ambiguity and make your content easier to verify and cite. Not "many companies use this approach," but "Stripe, Shopify, and HubSpot use this approach."

Strike the Right Tone

Cited text clustered at a subjectivity score of 0.47. Not dry fact, not emotional opinion. The sweet spot is analyst commentary: fact plus interpretation. The Flesch-Kincaid grade level of 16 outperformed dense academic prose at 19.1. Business-grade clarity wins over academic density.

CHATGPT-CITED SOURCES
Grade 16
UNCITED SOURCES
Grade 19.1

Optimize Content and Section Length

More findings from the Wellows report:

  • Total length: Articles under 800 words averaged 3.2 citations; over 2,900 words averaged 5.1
  • Section length: 120-180 words between headings performed best (4.6 citations average)
  • Expert quotes: Pages with expert quotes averaged 4.1 citations versus 2.4 without
  • Statistical data: Content with 19+ data points averaged 5.4 citations versus 2.8 for minimal data
  • Freshness: Content updated within 30 days gets 3.2x more citations
WORDS = 3.2 CITATIONS
800
WORDS = HIGHEST DENSITY
1500-2000
WORDS = 5.1 CITATIONS
2900+

The FAQPage Schema Advantage

A study on ChatGPT visibility found a striking correlation between structured data and citation rates:

WITH FAQ SCHEMA
0
WITHOUT FAQ SCHEMA
0

6.2% of ChatGPT-visible websites had FAQPage schema versus only 0.8% of non-visible websites . Nearly 8x difference. JSON-LD helps LLMs understand content context: is this an expert article, a product with reviews, or a direct answer? Structured data is one of several key factors in AI agent readiness .

For e-commerce sites, OpenAI goes a step further. They accept structured product feeds (title, description, image, brand, SKU, price, availability, GTIN) for ChatGPT Shopping. This is a direct pipeline into ChatGPT's product recommendations.

Who Gets Cited the Most?

Multiple studies have looked at which domains dominate ChatGPT citations:

The Ahrefs study (9.6 million queries) found the top cited domains in the U.S. are Reddit, Wikipedia, Amazon, Forbes, and Business Insider. Wikipedia is cited by ChatGPT at 16.3% (versus 12.5% on Perplexity and 8.4% on Google AI Overviews).

The Profound study (730,000 conversations, Q4 2025) adds context:

  • Wikipedia appears in ~1 in 6 conversations with citations (18%)
  • Reddit appears in 13%
  • Reuters and NIH each at 4%
  • Turn 1 is 2.5x more likely to trigger citations than turn 10, and 4x more than turn 20

The Visual Capitalist / Ahrefs analysis (78.6 million searches) found Reddit leads across all AI models with 40.1% citation frequency, followed by Wikipedia at 26.3%.

How ChatGPT Search Differs from Google

Knowing the differences helps you optimize for both:

Factor Google Search ChatGPT Search
Output format List of 10 blue links Synthesized answer with inline citations
Source concentration Billions of indexed pages Top 50 domains get 48% of mentions
Content freshness Important for news 3.2x more citations for content updated within 30 days
Source independence N/A Largely independent from Google rankings
Lower-ranked pages Position 10 gets ~2.5% clicks Position 10 gets ~4% citation rate (higher opportunity)
Content format Rewards diverse formats Prefers structured headings, bullet lists, tables

What You Can Do Today

Based on the research, the highest-impact actions ranked by evidence strength:

1. ALLOW OAI-SEARCHBOT

This is the binary gate. No access = no citations. Check your robots.txt and CDN firewall rules.

2. ADD FAQPAGE SCHEMA

The strongest structured data signal. 6.2% of visible sites have it versus 0.8% of non-visible sites — an 8x difference.

3. FRONT-LOAD CONTENT

Put your key information in the first third. 44% of citations come from the opening 30% of content.

4. USE Q&A HEADINGS

Write H2s as questions. ChatGPT treats H2s as prompts and the paragraph below as the answer.

  • Include expert quotes and statistics . Data-rich content gets nearly 2x more citations
  • Keep sections 120-180 words. The optimal length between headings
  • Update content regularly . Freshness within 30 days provides a 3.2x citation boost
  • Use specific entities . Proper nouns, brand names, and tool names reduce ambiguity
  • Server-side render your content . ChatGPT's crawlers cannot execute client-side JavaScript. Learn more about how AI agents see your website through the accessibility tree

Wrapping Up

ChatGPT's source selection rewards a specific combination: high domain authority, well-structured content with Q&A headings, rich structured data, and fresh updates. It's not a copy of Google rankings. Different game, different rules.

The sites investing in these signals today are building an advantage. And with 52% of citations going to sites outside the top 50 domains, there's real opportunity for specialized content to break through. All findings in context in the analysis of Vercel's 2026 AEO report .

Curious how other AI systems choose sources? Read the companion posts on how Claude selects sources to cite and how Google AI Overviews selects sources .

Sources

Ready to check?

SCAN YOUR WEBSITE

Get your AI agent readiness score with actionable recommendations across 5 categories.

  • Free instant scan with letter grade
  • 5 categories, 47 checkpoints
  • Code examples for every recommendation

RELATED ARTICLES

Continue reading about AI agent readiness and web optimization.

Content Negotiation for AI Agents: Why Sentry Serves Markdown Over HTML
9 min read

Content Negotiation for AI Agents: Why Sentry Serves Markdown Over HTML

Sentry co-founder David Cramer shows how content negotiation — a 25-year-old HTTP standard — saves AI agents 80% of tokens. We break down the implementation: Accept headers, markdown delivery, authenticated page redirects, and what this means for every website preparing for agent traffic.

ai-agents seo getting-started
Cloudflare /crawl Endpoint: One API Call to Crawl Any Website
9 min read

Cloudflare /crawl Endpoint: One API Call to Crawl Any Website

Cloudflare launched a /crawl endpoint that crawls entire websites with one API call — returning HTML, Markdown, or AI-extracted JSON. We break down what this means for AI agent readiness: why your robots.txt, sitemap, semantic HTML, and server-side rendering now matter more than ever.

ai-agents seo getting-started
AI Crawlers Ignore llms.txt — But AI Agents Don't
9 min read

AI Crawlers Ignore llms.txt — But AI Agents Don't

Dries Buytaert's data shows zero AI crawlers use llms.txt. But he measured the wrong thing. Crawlers scrape for training data — agents complete tasks. We break down why the crawler vs agent distinction matters, which coding agents already use llms.txt and content negotiation, and what you should implement today.

ai-agents seo getting-started

EXPLORE MORE

Most websites score under 45. Find out where you stand.

RANKINGS
SEE HOW OTHERS SCORE

RANKINGS

Browse AI readiness scores for scanned websites.
COMPARE
HEAD TO HEAD

COMPARE

Compare two websites side-by-side across all 5 categories and 47 checkpoints.
ABOUT
HOW WE MEASURE

ABOUT

Learn about our 5-category scoring methodology.