How ChatGPT Search Chooses Which Websites to Cite

February 10, 2026 • 10 min read

Bart Waardenburg

AI Agent Readiness Expert & Founder

ChatGPT Search has quickly changed how people find information online. No more scanning ten blue links. Millions of users now get a synthesized answer with inline citations, and the websites that get cited capture a new kind of organic traffic. But how does ChatGPT decide which sites make the cut?

I went through OpenAI's official documentation, multiple large-scale citation studies, and empirical data to find out. The short version: ChatGPT's source selection is surprisingly different from traditional search, and understanding those differences is a real competitive advantage. For a broader look at how AI optimization compares to traditional SEO, see the guide on SEO vs AEO .

The Search Infrastructure Behind ChatGPT

ChatGPT Search doesn't crawl the web in real-time when you ask a question. It relies on a pre-built search index. And that's where things get interesting.

ChatGPT Search was originally built on Bing's search index . However, two independent studies from mid-2025 found evidence that ChatGPT's paid version also uses Google Search results. As of May 2025, OpenAI added Shopify as a third-party search provider for commerce queries. So ChatGPT isn't locked to a single search backend. It aggregates.

The Three Crawlers You Need to Know

OpenAI operates three web crawlers, each with a different purpose. Getting this right is the first gate to ChatGPT visibility. The breakdown from OpenAI's official crawler documentation :

OAI-SEARCHBOT

Indexes content specifically for ChatGPT Search results. Blocking this bot means your site won't appear in ChatGPT search answers.

GPTBOT

Crawls content for AI model training. You can block this without affecting your search visibility.

CHATGPT-USER

Fetches pages on demand when a user explicitly asks ChatGPT to browse a specific URL.

A detail you might miss in the docs: OAI-SearchBot and GPTBot share crawl data. If your site allows both bots, OpenAI can use a single crawl for both purposes to avoid duplicate requests.

The recommended robots.txt configuration for maximum ChatGPT visibility:

robots.txt plain

# ChatGPT search indexing (required for search visibility)
User-agent: OAI-SearchBot
Allow: /

# AI model training (optional - blocking this does NOT affect search)
User-agent: GPTBot
Allow: /

# User-initiated browsing (robots.txt ignored since Dec 2025)
User-agent: ChatGPT-User
Allow: /

From OpenAI's Publishers FAQ : "Any public website can appear in ChatGPT search." So you can appear in search even if you opt out of AI training by blocking GPTBot. The crawlers are independent.

Domain Authority Is the #1 Factor

The most comprehensive study on ChatGPT citations comes from Wellows (7,785 queries, 485,000+ citations). The conclusion is clear: referring domains (backlinks) is the single strongest predictor of whether ChatGPT cites a website.

2,500 REF. DOMAINS

1.6

350,000+ REF. DOMAINS

8.4

AVG CITATIONS/QUERY

Sites with 2,500 referring domains averaged 1.6 citations per query. Sites with 350,000+ referring domains: 8.4. Five times more. Domain traffic is the second factor, but that correlation only shows up at very high volumes.

Confirmed by Search Engine Journal's analysis of the same dataset: domain-level authority outweighs page-level metrics. ChatGPT trusts the domain more than the individual page.

Google ranking position does correlate with ChatGPT citations. Positions 1-45 averaged 5 citations versus 3.1 for positions 64-75. But that's probably because the same signals (backlinks, authority) drive both.

The Long Tail Opportunity

Good news if you're not Wikipedia. The Wellows report found that the top 50 websites capture only 48% of all mentions. The remaining 52% goes to smaller, niche sites.

TOP 50 DOMAINS

LONG TAIL

The Profound AI Search Shift study adds something encouraging: only a small portion of ChatGPT's citations match Google search results. ChatGPT maintains largely independent source selection. Don't rank well on Google? You can still get cited by ChatGPT if you have authority and good content structure.

How to Structure Content for ChatGPT Citations

A study by Kevin Indig / Growth Memo (published via Search Engine Land ) analyzed 3 million ChatGPT responses and 30 million citations. These are correlations in what cited content tends to look like, not levers proven to cause citations:

Front-Load Your Key Information

44.2% of citations come from the first 30% of content in a consistent "ski ramp" pattern. Information at the top of your article gets cited far more than content buried at the bottom. The opposite of the "inverted pyramid" from journalism, more like an encyclopedia-style opening.

Use Q&A Heading Structure

Cited content skews heavily toward conversational Q&A structure. Among citations tied to questions, 78.4% came from H2 headings. ChatGPT treats your H2s as prompts and the paragraph below as the answer. Write your headings as questions ("How does X work?" or "What is Y?") and you're directly matching how users query ChatGPT.

Pack in Specific Entities

Cited text averaged 20.6% proper nouns (versus 5-8% in typical English). Specific brands, tools, people, and place names reduce ambiguity and make your content easier to verify and cite. Not "many companies use this approach," but "Stripe, Shopify, and HubSpot use this approach."

Strike the Right Tone

Cited text clustered at a subjectivity score of 0.47. Not dry fact, not emotional opinion. The sweet spot is analyst commentary: fact plus interpretation. The Flesch-Kincaid grade level of 16 outperformed dense academic prose at 19.1. Business-grade clarity wins over academic density.

CHATGPT-CITED SOURCES

Grade 16

UNCITED SOURCES

Grade 19.1

Optimize Content and Section Length

More findings from the Wellows report:

Total length: Articles under 800 words averaged 3.2 citations; over 2,900 words averaged 5.1
Section length: 120-180 words between headings performed best (4.6 citations average)
Expert quotes: Pages with expert quotes averaged 4.1 citations versus 2.4 without
Statistical data: Content with 19+ data points averaged 5.4 citations versus 2.8 for minimal data
Freshness: Content updated within 30 days is associated with about 3.2x more citations (Wellows)

WORDS = 3.2 CITATIONS

800

WORDS = HIGHEST DENSITY

1500-2000

WORDS = 5.1 CITATIONS

2900+

The FAQPage Schema Advantage

A study on ChatGPT visibility found a striking correlation between structured data and citation rates:

WITH FAQ SCHEMA

WITHOUT FAQ SCHEMA

6.2% of ChatGPT-visible websites had FAQPage schema versus only 0.8% of non-visible websites , an 8x gap. Read that as a correlation: cited sites are far more likely to carry FAQPage schema, not proof that schema makes them get cited. FAQPage makes Q&A machine-readable; the 8x gap is correlational. Structured data is one of several key factors in AI agent readiness .

A 2026 Ahrefs study of 1,885 pages found that adding schema did not measurably change AI-search citations, so treat this as a correlation (cited pages tend to be well-structured) rather than schema directly driving citations.

For e-commerce sites, OpenAI goes a step further. They accept structured product feeds (title, description, image, brand, SKU, price, availability, GTIN) for ChatGPT Shopping. This is a direct pipeline into ChatGPT's product recommendations.

Who Gets Cited the Most?

Multiple studies have looked at which domains dominate ChatGPT citations:

The Ahrefs study (9.6 million queries) found the top cited domains in the U.S. are Reddit, Wikipedia, Amazon, Forbes, and Business Insider. Wikipedia is cited by ChatGPT at 16.3% (versus 12.5% on Perplexity and 8.4% on Google AI Overviews).

The Profound study (730,000 conversations, Q4 2025) adds context:

Wikipedia appears in ~1 in 6 conversations with citations (18%)
Reddit appears in 13%
Reuters and NIH each at 4%
Turn 1 is 2.5x more likely to trigger citations than turn 10, and 4x more than turn 20

The Visual Capitalist / Ahrefs analysis (78.6 million searches) found Reddit leads across all AI models with 40.1% citation frequency, followed by Wikipedia at 26.3%.

How ChatGPT Search Differs from Google

Knowing the differences helps you optimize for both:

Factor	Google Search	ChatGPT Search
Output format	List of 10 blue links	Synthesized answer with inline citations
Source concentration	Billions of indexed pages	Top 50 domains get 48% of mentions
Content freshness	Important for news	3.2x more citations for content updated within 30 days
Source independence	N/A	Largely independent from Google rankings
Lower-ranked pages	Position 10 gets ~2.5% clicks	Position 10 gets ~4% citation rate (higher opportunity)
Content format	Rewards diverse formats	Prefers structured headings, bullet lists, tables

What You Can Do Today

Based on the research, the highest-impact actions ranked by evidence strength:

1. ALLOW OAI-SEARCHBOT

This is the binary gate. No access = no citations. Check your robots.txt and CDN firewall rules.

2. ADD FAQPAGE SCHEMA

A strong correlation: cited sites carry it 8x more often (6.2% versus 0.8%). Schema didn't move citations in controlled tests, but it makes your Q&A machine-readable.

3. FRONT-LOAD CONTENT

Put your key information in the first third. 44% of citations come from the opening 30% of content.

4. USE Q&A HEADINGS

Write H2s as questions. ChatGPT treats H2s as prompts and the paragraph below as the answer.

Include expert quotes and statistics . Data-rich content gets nearly 2x more citations
Keep sections 120-180 words. The optimal length between headings
Update content regularly . Content fresh within 30 days is associated with about 3.2x more citations
Use specific entities . Proper nouns, brand names, and tool names reduce ambiguity
Server-side render your content . ChatGPT's crawlers cannot execute client-side JavaScript. Learn more about how AI agents see your website through the accessibility tree

Wrapping Up

ChatGPT's source selection rewards a specific combination: high domain authority, well-structured content with Q&A headings, rich structured data, and fresh updates. It's not a copy of Google rankings. Different game, different rules.

The sites investing in these signals today are building an advantage. And with 52% of citations going to sites outside the top 50 domains, there's real opportunity for specialized content to break through. All findings in context in the analysis of Vercel's 2026 AEO report .

Curious how other AI systems choose sources? Read the companion posts on how Claude selects sources to cite and how Google AI Overviews selects sources .

Sources

OpenAI Crawlers Documentation -Official crawler specifications and robots.txt guidance
OpenAI Publishers and Developers FAQ -Official guidance on ChatGPT Search visibility
OpenAI: Introducing ChatGPT Search -Official launch announcement
Wellows: 7K Queries, 485K Citations -The most comprehensive ChatGPT citation study
Search Engine Land: 3M ChatGPT Responses, 30M Citations -Content structure analysis
Search Engine Journal: Top Factors Influencing ChatGPT Citations -Domain authority analysis
Insightland: Structured Data and AI Search -FAQPage schema visibility correlation
Ahrefs: 100 Most Cited Domains in ChatGPT -9.6 million queries analysis
Profound: How ChatGPT Sources the Web -730,000 conversations citation analysis
Visual Capitalist: Most Cited Websites by AI Models -78.6 million searches cross-platform analysis
Profound: AI Search Shift -ChatGPT source independence from Google
OpenAI: Product Feed Specification -ChatGPT Shopping structured data

How ChatGPT Search Chooses Which Websites to Cite

The Search Infrastructure Behind ChatGPT

The Three Crawlers You Need to Know

OAI-SEARCHBOT

GPTBOT

CHATGPT-USER

Domain Authority Is the #1 Factor

The Long Tail Opportunity

How to Structure Content for ChatGPT Citations

Front-Load Your Key Information

Use Q&A Heading Structure

Pack in Specific Entities

Strike the Right Tone

Optimize Content and Section Length

The FAQPage Schema Advantage

Who Gets Cited the Most?

How ChatGPT Search Differs from Google

What You Can Do Today

1. ALLOW OAI-SEARCHBOT

2. ADD FAQPAGE SCHEMA

3. FRONT-LOAD CONTENT

4. USE Q&A HEADINGS

Wrapping Up

Sources

SCAN YOUR WEBSITE

RELATED ARTICLES

Does Schema Markup Get You Cited by AI? What the Data Actually Shows

Content Negotiation for AI Agents: Why Sentry Serves Markdown Over HTML

Cloudflare /crawl Endpoint: One API Call to Crawl Any Website

EXPLORE MORE

RANKINGS

COMPARE

ABOUT