How ChatGPT Search Chooses Which Websites to Cite
ChatGPT Search has quickly changed how people find information online. No more scanning ten blue links. Millions of users now get a synthesized answer with inline citations, and the websites that get cited capture a new kind of organic traffic. But how does ChatGPT decide which sites make the cut?
I went through OpenAI's official documentation, multiple large-scale citation studies, and empirical data to find out. The short version: ChatGPT's source selection is surprisingly different from traditional search, and understanding those differences is a real competitive advantage. For a broader look at how AI optimization compares to traditional SEO, see the guide on SEO vs AEO .
The Search Infrastructure Behind ChatGPT
ChatGPT Search doesn't crawl the web in real-time when you ask a question. It relies on a pre-built search index. And that's where things get interesting.
ChatGPT Search was originally built on Bing's search index . However, two independent studies from mid-2025 found evidence that ChatGPT's paid version also uses Google Search results. As of May 2025, OpenAI added Shopify as a third-party search provider for commerce queries. So ChatGPT isn't locked to a single search backend. It aggregates.
The Three Crawlers You Need to Know
OpenAI operates three web crawlers, each with a different purpose. Getting this right is the first gate to ChatGPT visibility. The breakdown from OpenAI's official crawler documentation :
OAI-SEARCHBOT
Indexes content specifically for ChatGPT Search results. Blocking this bot means your site won't appear in ChatGPT search answers.
GPTBOT
Crawls content for AI model training. You can block this without affecting your search visibility.
CHATGPT-USER
Fetches pages on demand when a user explicitly asks ChatGPT to browse a specific URL.
A detail you might miss in the docs: OAI-SearchBot and GPTBot share crawl data. If your site allows both bots, OpenAI can use a single crawl for both purposes to avoid duplicate requests.
The recommended
robots.txt
configuration for maximum ChatGPT visibility:
# ChatGPT search indexing (required for search visibility)
User-agent: OAI-SearchBot
Allow: /
# AI model training (optional - blocking this does NOT affect search)
User-agent: GPTBot
Allow: /
# User-initiated browsing (robots.txt ignored since Dec 2025)
User-agent: ChatGPT-User
Allow: /
From OpenAI's Publishers FAQ : "Any public website can appear in ChatGPT search." So you can appear in search even if you opt out of AI training by blocking GPTBot. The crawlers are independent.
Domain Authority Is the #1 Factor
The most comprehensive study on ChatGPT citations comes from Wellows (7,785 queries, 485,000+ citations). The conclusion is clear: referring domains (backlinks) is the single strongest predictor of whether ChatGPT cites a website.
Sites with 2,500 referring domains averaged 1.6 citations per query. Sites with 350,000+ referring domains: 8.4. Five times more. Domain traffic is the second factor, but that correlation only shows up at very high volumes.
Confirmed by Search Engine Journal's analysis of the same dataset: domain-level authority outweighs page-level metrics. ChatGPT trusts the domain more than the individual page.
Google ranking position does correlate with ChatGPT citations. Positions 1-45 averaged 5 citations versus 3.1 for positions 64-75. But that's probably because the same signals (backlinks, authority) drive both.
The Long Tail Opportunity
Good news if you're not Wikipedia. The Wellows report found that the top 50 websites capture only 48% of all mentions. The remaining 52% goes to smaller, niche sites.
The Profound AI Search Shift study adds something encouraging: only a small portion of ChatGPT's citations match Google search results. ChatGPT maintains largely independent source selection. Don't rank well on Google? You can still get cited by ChatGPT if you have authority and good content structure.
How to Structure Content for ChatGPT Citations
A study by Search Engine Land analyzed 3 million ChatGPT responses and 30 million citations. The findings are worth your time:
Front-Load Your Key Information
44.2% of citations come from the first 30% of content in a consistent "ski ramp" pattern. Information at the top of your article gets cited far more than content buried at the bottom. The opposite of the "inverted pyramid" from journalism, more like an encyclopedia-style opening.
Use Q&A Heading Structure
Conversational Q&A structure doubles citation likelihood. 78.4% of citations tied to questions came from H2 headings. ChatGPT treats your H2s as prompts and the paragraph below as the answer. Write your headings as questions ("How does X work?" or "What is Y?") and you're directly matching how users query ChatGPT.
Pack in Specific Entities
Cited text averaged 20.6% proper nouns (versus 5-8% in typical English). Specific brands, tools, people, and place names reduce ambiguity and make your content easier to verify and cite. Not "many companies use this approach," but "Stripe, Shopify, and HubSpot use this approach."
Strike the Right Tone
Cited text clustered at a subjectivity score of 0.47. Not dry fact, not emotional opinion. The sweet spot is analyst commentary: fact plus interpretation. The Flesch-Kincaid grade level of 16 outperformed dense academic prose at 19.1. Business-grade clarity wins over academic density.
Optimize Content and Section Length
More findings from the Wellows report:
- Total length: Articles under 800 words averaged 3.2 citations; over 2,900 words averaged 5.1
- Section length: 120-180 words between headings performed best (4.6 citations average)
- Expert quotes: Pages with expert quotes averaged 4.1 citations versus 2.4 without
- Statistical data: Content with 19+ data points averaged 5.4 citations versus 2.8 for minimal data
- Freshness: Content updated within 30 days gets 3.2x more citations
The FAQPage Schema Advantage
A study on ChatGPT visibility found a striking correlation between structured data and citation rates:
6.2% of ChatGPT-visible websites had FAQPage schema versus only 0.8% of non-visible websites . Nearly 8x difference. JSON-LD helps LLMs understand content context: is this an expert article, a product with reviews, or a direct answer? Structured data is one of several key factors in AI agent readiness .
For e-commerce sites, OpenAI goes a step further. They accept structured product feeds (title, description, image, brand, SKU, price, availability, GTIN) for ChatGPT Shopping. This is a direct pipeline into ChatGPT's product recommendations.
Who Gets Cited the Most?
Multiple studies have looked at which domains dominate ChatGPT citations:
The Ahrefs study (9.6 million queries) found the top cited domains in the U.S. are Reddit, Wikipedia, Amazon, Forbes, and Business Insider. Wikipedia is cited by ChatGPT at 16.3% (versus 12.5% on Perplexity and 8.4% on Google AI Overviews).
The Profound study (730,000 conversations, Q4 2025) adds context:
- Wikipedia appears in ~1 in 6 conversations with citations (18%)
- Reddit appears in 13%
- Reuters and NIH each at 4%
- Turn 1 is 2.5x more likely to trigger citations than turn 10, and 4x more than turn 20
The Visual Capitalist / Ahrefs analysis (78.6 million searches) found Reddit leads across all AI models with 40.1% citation frequency, followed by Wikipedia at 26.3%.
How ChatGPT Search Differs from Google
Knowing the differences helps you optimize for both:
| Factor | Google Search | ChatGPT Search |
|---|---|---|
| Output format | List of 10 blue links | Synthesized answer with inline citations |
| Source concentration | Billions of indexed pages | Top 50 domains get 48% of mentions |
| Content freshness | Important for news | 3.2x more citations for content updated within 30 days |
| Source independence | N/A | Largely independent from Google rankings |
| Lower-ranked pages | Position 10 gets ~2.5% clicks | Position 10 gets ~4% citation rate (higher opportunity) |
| Content format | Rewards diverse formats | Prefers structured headings, bullet lists, tables |
What You Can Do Today
Based on the research, the highest-impact actions ranked by evidence strength:
1. ALLOW OAI-SEARCHBOT
This is the binary gate. No access = no citations. Check your
robots.txt
and CDN firewall rules.
2. ADD FAQPAGE SCHEMA
The strongest structured data signal. 6.2% of visible sites have it versus 0.8% of non-visible sites — an 8x difference.
3. FRONT-LOAD CONTENT
Put your key information in the first third. 44% of citations come from the opening 30% of content.
4. USE Q&A HEADINGS
Write H2s as questions. ChatGPT treats H2s as prompts and the paragraph below as the answer.
- Include expert quotes and statistics . Data-rich content gets nearly 2x more citations
- Keep sections 120-180 words. The optimal length between headings
- Update content regularly . Freshness within 30 days provides a 3.2x citation boost
- Use specific entities . Proper nouns, brand names, and tool names reduce ambiguity
- Server-side render your content . ChatGPT's crawlers cannot execute client-side JavaScript. Learn more about how AI agents see your website through the accessibility tree
Wrapping Up
ChatGPT's source selection rewards a specific combination: high domain authority, well-structured content with Q&A headings, rich structured data, and fresh updates. It's not a copy of Google rankings. Different game, different rules.
The sites investing in these signals today are building an advantage. And with 52% of citations going to sites outside the top 50 domains, there's real opportunity for specialized content to break through. All findings in context in the analysis of Vercel's 2026 AEO report .
Curious how other AI systems choose sources? Read the companion posts on how Claude selects sources to cite and how Google AI Overviews selects sources .
Sources
- OpenAI Crawlers Documentation -Official crawler specifications and robots.txt guidance
- OpenAI Publishers and Developers FAQ -Official guidance on ChatGPT Search visibility
- OpenAI: Introducing ChatGPT Search -Official launch announcement
- Wellows: 7K Queries, 485K Citations -The most comprehensive ChatGPT citation study
- Search Engine Land: 3M ChatGPT Responses, 30M Citations -Content structure analysis
- Search Engine Journal: Top Factors Influencing ChatGPT Citations -Domain authority analysis
- Insightland: Structured Data and AI Search -FAQPage schema visibility correlation
- Ahrefs: 100 Most Cited Domains in ChatGPT -9.6 million queries analysis
- Profound: How ChatGPT Sources the Web -730,000 conversations citation analysis
- Visual Capitalist: Most Cited Websites by AI Models -78.6 million searches cross-platform analysis
- Profound: AI Search Shift -ChatGPT source independence from Google
- OpenAI: Product Feed Specification -ChatGPT Shopping structured data