Technical21 February 2026·15 min read

How AI Models Really Decide What to Say About Your Company

When a candidate asks ChatGPT "What's it like to work at [your company]?", the answer appears in seconds. Confident. Specific. Often wrong.

How did the AI arrive at that answer? Where did it get the salary figure? Why did it cite Reddit but not your careers page? And most importantly: where can you intervene to control what it says?

This is a technical deep-dive into the architecture behind AI responses — from training data to inference to citation selection.

If you're an employer brand professional, TA leader, or developer implementing AI visibility, this is how the machine actually works.

The Three-Layer Knowledge Stack
How Training Data Determines Base Knowledge
Retrieval Augmented Generation (RAG)
Entity Recognition & Knowledge Graphs
Citation Selection Logic
Model-Specific Differences
Where You Can Intervene

1. The Three-Layer Knowledge Stack

When an AI model answers a question about your company, it pulls from three distinct knowledge layers:

Layer 1: Base Model (Pre-Training)

What it is: Facts learned during training on billions of web pages.

Characteristics:

Frozen — doesn't update unless the model is retrained
Cutoff date — typically 6-18 months old
Broad but shallow — knows about major companies, weak on specifics
Biased toward popular content — Wikipedia, news, Reddit get overrepresented

Example: If your company was mentioned in TechCrunch in 2023, that fact is "baked into" GPT-4's base knowledge.

Implication: For established brands, base knowledge provides a foundation. For newer companies (<5 years old), the base model likely knows very little.

Layer 2: Retrieved Context (RAG)

What it is: Live search results or vector database lookups that the model performs at query time.

Characteristics:

Fresh — can pull content from the last few hours
Selective — model retrieves ~5-10 sources, not the entire web
Ranker-dependent — search algorithm determines what gets retrieved
User-invisible — most models don't show you the retrieved sources (Perplexity is the exception)

Example: You updated your careers page yesterday. If the AI retrieves it via RAG, today's response will include that new information.

Implication: This is your highest-leverage intervention point. If AI retrieves your content, you control the narrative. If it doesn't, you're at the mercy of whatever third-party content gets retrieved instead.

Layer 3: Structured Knowledge (Entity Databases)

What it is: Verified facts from knowledge graphs and structured sources.

Characteristics:

High trust — model treats these as authoritative
Specific schemas — Organization, Person, Place, etc.
Cross-referenced — multiple sources confirm the same fact
Machine-readable — JSON-LD, Wikidata, schema.org

Example: If your company has a Wikipedia page with an infobox, that structured data is treated as "ground truth" by most AI models.

Implication: If you can get facts into this layer (via schema.org markup, Wikidata, or industry directories), AI trusts them more than any marketing copy.

How the Layers Interact

When a user asks "What's the salary for a senior engineer at Acme Corp?", the AI:

Checks base knowledge — "Do I know what Acme Corp is?"
- If yes → provides context (industry, size, location)
- If no → relies entirely on retrieval
Performs retrieval — Searches for "Acme Corp senior engineer salary"
- Retrieves 5-10 sources (job listings, forum posts, salary aggregators)
- Ranks them by relevance and authority
Cross-references structured data — "Does Acme's schema.org markup include salary data?"
- If yes → treats that as the most reliable figure
- If no → synthesizes from retrieved sources
Generates answer — Combines all three layers, weighted by confidence

The critical insight: If you control Layer 2 (retrieval) and Layer 3 (structured data), you control the answer — even if the base model (Layer 1) knows nothing about you.

2. How Training Data Determines Base Knowledge

All AI models start with pre-training — learning patterns from massive text corpora.

What Gets Included in Training Data

For GPT-4, Claude, and similar models:

Common Crawl (billions of web pages)
Wikipedia (all languages)
Books (Project Gutenberg, public domain, licensed content)
Reddit (via licensing deals)
GitHub (code repositories)
News archives (AP, Reuters, etc.)
Academic papers (arXiv, PubMed)

What's typically EXCLUDED:

Content behind paywalls
Dynamically generated pages (JavaScript-heavy sites)
Pages that block crawlers (robots.txt Disallow)
Content from after the cutoff date

Employer-Specific Implications

High training data representation:

Large public companies (e.g., Google, Amazon)
Companies with extensive Wikipedia coverage
Companies frequently discussed on Reddit, Hacker News
Companies with heavy media coverage

Low training data representation:

Startups <5 years old
B2B companies with minimal consumer presence
Regional companies with limited online discussion
Companies that block web crawlers

Example:

Google: 100,000+ mentions in training data → base model knows extensive details
Series A startup: 50 mentions → base model knows almost nothing, relies on retrieval

Training Data Decay

Problem: Training data becomes outdated.

If GPT-4's cutoff is April 2023, and you:

Rebranded in June 2023 → base model uses the old name
Raised a Series C in 2024 → base model thinks you're Series B
Moved headquarters in 2025 → base model has the old location

Solution: This is why Layer 2 (RAG) exists. Fresh retrieval can override stale base knowledge — if the model retrieves your updated content.

3. Retrieval Augmented Generation (RAG)

RAG is the key to understanding modern AI responses. Here's how it works:

Step 1: Query Analysis

User asks: "What's it like to work at Acme Corp?"

The AI:

Identifies the query intent (information-seeking, employer research)
Extracts the entity ("Acme Corp")
Determines what type of information is needed (culture, benefits, environment)

Step 2: Search Query Construction

The AI generates multiple search queries to retrieve relevant context:

"Acme Corp company culture"
"Acme Corp employee reviews"
"Working at Acme Corp"
"Acme Corp benefits"
"Acme Corp careers"

These are sent to:

A search engine API (often Bing or Google)
A vector database of indexed content
An internal knowledge graph

Step 3: Document Retrieval & Ranking

The AI receives hundreds of candidate documents and ranks them by:

Relevance (semantic similarity to the query)
Authority (domain trust score, backlinks)
Freshness (publication/update date)
Structure (how easy it is to extract facts)

Top 5-10 documents are selected for the context window.

Step 4: Context Injection

The AI's actual prompt becomes:

You are a helpful assistant. Answer the user's question using the provided context.

Context:
[Document 1: Acme Corp Wikipedia page]
[Document 2: Glassdoor reviews for Acme Corp]
[Document 3: Reddit thread about Acme Corp salaries]
[Document 4: Acme Corp careers page]
[Document 5: TechCrunch article about Acme's Series B]

User question: What's it like to work at Acme Corp?

Your answer:

The AI then generates a response based primarily on these 5 documents, not on the entirety of its training data.

Step 5: Answer Synthesis

The AI:

Extracts relevant facts from the top documents
Cross-references them (e.g., "salary mentioned in both doc 3 and doc 4")
Weights by source authority (Wikipedia > Reddit)
Generates a natural language response

Critical insight: If your careers page isn't in those top 5-10 retrieved documents, it doesn't influence the answer — no matter how well-written it is.

Why Your Careers Page Might Not Be Retrieved

Common reasons AI doesn't retrieve your official content:

1. Low Domain Authority

Your website has few backlinks compared to competitors
Google/Bing rank you low in search results
AI's ranker mirrors this

Fix: Build backlinks through PR, guest posts, partnerships.

2. Poor Semantic Match

Your careers page uses marketing jargon AI doesn't associate with candidate queries
Example: Page says "dynamic ecosystem of innovation" instead of "fast-paced environment"

Fix: Use the exact language candidates use when searching. Check Google Search Console for actual query terms.

3. Rendering Issues

Your careers page requires JavaScript to render
AI crawlers see an empty page or loading spinner

Fix: Server-side rendering (SSR) or static site generation (SSG). Use Next.js, Gatsby, or similar.

4. Blocking AI Crawlers

Your robots.txt disallows GPTBot, ClaudeBot, etc.
AI can't index your content for retrieval

Fix: Allow AI crawlers unless legally required not to.

5. No Clear Answers

Page is full of prose, no structured Q&A
AI can't easily extract facts to quote

Fix: Reformat as FAQ. Make facts extractable.

4. Entity Recognition & Knowledge Graphs

AI models don't just search for text — they recognize entities and query knowledge graphs.

What is an Entity?

An entity is a uniquely identifiable thing: a person, company, place, product.

When you mention "Acme Corp", the AI:

Recognizes it as an entity (type: Organization)
Looks it up in knowledge graphs (Wikidata, Google Knowledge Graph, proprietary DBs)
Retrieves structured facts (founded, headquarters, CEO, industry)

Knowledge Graph Structure

Entity: Acme Corp
Type: Organization
Properties:
  - foundedDate: 2015
  - headquarters: London, UK
  - industry: Financial Technology
  - numberOfEmployees: 250
  - website: acmecorp.com
  - sameAs: [Wikipedia URL, LinkedIn URL, Crunchbase URL]

How Schema.org Markup Feeds Knowledge Graphs

When you add JSON-LD to your homepage:

{
  "@context": "https://schema.org",
  "@type": "Organization",
  "name": "Acme Corp",
  "foundingDate": "2015",
  "numberOfEmployees": 250,
  "address": {
    "@type": "PostalAddress",
    "addressLocality": "London",
    "addressCountry": "GB"
  }
}

What happens:

Google crawls your page and extracts the schema
Google adds/updates your entity in the Knowledge Graph
AI models query the Knowledge Graph when answering questions
Your verified data gets priority over unstructured mentions

Evidence: Companies with complete schema.org markup had 40% higher factual accuracy in AI responses.

5. Citation Selection Logic

Different AI models handle citations differently. Here's how each decides what to cite:

ChatGPT (GPT-4)

Citation behavior: Rarely cites sources explicitly. Synthesizes from multiple sources without attribution.

When it does cite:

User asks "What are your sources?"
Information is controversial or surprising
Direct quotes

What it prefers to cite:

High-authority domains (Wikipedia, major news)
Recent content (last 12 months)
Structured data (schema.org, Wikidata)

Employer implication: ChatGPT will use your data without citing you explicitly. That's fine — factual accuracy matters more than attribution.

Google AI Overviews

Citation behavior: Always cites sources with clickable links.

What it prefers to cite:

Pages already ranking in top 10 Google search results
Reddit (weighted heavily in recent algorithm updates)
YouTube (owned by Google)
LinkedIn (high trust for employer info)
Government/education domains (.gov, .edu)

Selection logic:

Runs a Google search for the query
Takes top 3-5 results
Extracts relevant facts
Generates answer with inline citations

Employer implication: If you want to be cited in AI Overviews, you first need to rank in traditional Google search. GEO depends on SEO.

Perplexity

Citation behavior: Always cites, shows sources above the answer, allows users to click through.

What it prefers to cite:

Reddit (heavily weighted — 6.6% of all citations)
YouTube (2.0%)
Wikipedia (1.8%)
LinkedIn (1.4%)
Company websites (1.1%)

Selection logic:

Searches multiple sources (web + Reddit + news)
Ranks by relevance + freshness
Prioritizes content with clear answers
Shows top 5-8 sources

Employer implication: To get cited by Perplexity, be present on Reddit (participate in relevant discussions) and LinkedIn (post company updates).

Claude (Anthropic)

Citation behavior: Rarely cites unless asked. Relies heavily on training data.

When retrieval is enabled (Claude with search):

Similar to ChatGPT — synthesizes without explicit citation
Prefers authoritative sources
Strong bias toward recent content

Employer implication: Claude is harder to influence post-training. Focus on getting into training data (PR, backlinks, Wikipedia) and using RAG-enabled versions (e.g., Claude on Perplexity).

6. Model-Specific Differences

Each AI model has architectural quirks that affect what it says about you:

Model	Training Cutoff	RAG Enabled	Citation Style	Preferred Sources
GPT-4 (ChatGPT)	Apr 2023	Yes (browsing mode)	Rare	Wikipedia, Reddit, news
GPT-4o	Oct 2023	Yes (always on)	Occasional	Same as GPT-4
Claude 3.5	Jul 2023	Varies (plugin-dependent)	Rare	Training data-heavy
Google AI	Continuous	Yes (always on)	Always	Google search results, Reddit
Perplexity	Continuous	Yes (core feature)	Always	Reddit, YouTube, Wikipedia
Meta AI	Aug 2023	Yes	Occasional	Instagram, Facebook, web

Practical Implications

For maximum AI visibility:

Wikipedia presence → helps ALL models
Reddit activity → critical for Perplexity, Google AI
Schema.org markup → helps all retrieval-based models
Recent content updates → affects GPT-4o, Google AI, Perplexity
Traditional SEO → required for Google AI Overviews

7. Where You Can Intervene

Based on the architecture above, here are the 7 intervention points where you can influence what AI says:

Intervention Point 1: Training Data (Low Control)

What it is: Getting mentioned in sources that feed training data.

How to do it:

Wikipedia article (if you meet notability guidelines)
News coverage (TechCrunch, Reuters, industry press)
Reddit presence (participate authentically in relevant subreddits)
GitHub (open-source projects, engineering blog)

Timeline: 6-18 months (until next model retraining)

Difficulty: Hard (requires PR, media relations, community building)

Impact: High (becomes "ground truth" in base knowledge)

Intervention Point 2: Schema.org Markup (High Control)

What it is: Adding JSON-LD structured data to your website.

How to do it:

<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "Organization",
  "name": "Acme Corp",
  "description": "Fintech company building payment infrastructure",
  "foundingDate": "2015",
  "numberOfEmployees": 250,
  "address": {
    "@type": "PostalAddress",
    "addressLocality": "London",
    "addressCountry": "GB"
  },
  "sameAs": [
    "https://www.linkedin.com/company/acmecorp",
    "https://twitter.com/acmecorp"
  ]
}
</script>

Timeline: Immediate (takes effect once crawled)

Difficulty: Easy (one-time implementation)

Impact: Medium-High (30-40% improvement in factual accuracy)

Intervention Point 3: Content Structure (High Control)

What it is: Formatting your careers page for extractability.

How to do it:

FAQ format (H2 questions, paragraph answers)
Tables (e.g., salary ranges by role)
Bulleted lists (benefits, perks)
Semantic HTML (proper heading hierarchy)

Timeline: Immediate (once reindexed)

Difficulty: Medium (requires content rewrite)

Impact: Medium (20-25% increase in citation rate)

Intervention Point 4: Crawler Access (High Control)

What it is: Allowing AI bots to crawl your website.

How to do it: Check robots.txt, ensure these are NOT blocked:

User-agent: GPTBot
User-agent: ClaudeBot
User-agent: Google-Extended
User-agent: CCBot

Timeline: Immediate

Difficulty: Easy (one line of code)

Impact: Medium (15-20% increase in retrieval)

Intervention Point 5: Multi-Platform Presence (Medium Control)

What it is: Maintaining verified profiles across platforms AI trusts.

How to do it:

Complete LinkedIn Company Page
Claim Glassdoor profile (even if you can't control reviews)
Claim Indeed company page
Wikipedia (if eligible)
Crunchbase (for tech companies)

Timeline: Immediate (once profiles are populated)

Difficulty: Medium (requires ongoing maintenance)

Impact: Medium (15-18% increase in accurate citations)

Intervention Point 6: Freshness (Medium Control)

What it is: Regularly updating your content.

How to do it:

Add dateModified to schema markup
Update careers page quarterly
Publish blog posts, company news
Refresh job listings regularly

Timeline: Ongoing

Difficulty: Medium (requires process)

Impact: Low-Medium (8-12% boost)

Intervention Point 7: Domain Authority (Low Control)

What it is: Building backlinks and brand search volume.

How to do it:

PR and media coverage
Guest posts on industry blogs
Partnerships and co-marketing
Speaking at conferences

Timeline: Months to years

Difficulty: Hard (requires sustained effort)

Impact: High (14-18% correlation with citation frequency)

The Priority Stack

If you can only do three things, do these:

1. Implement Schema.org Markup

Why: Highest ROI. One-time effort, immediate impact, works across all models.

What to add:

Organization schema (homepage)
FAQPage schema (careers page)
JobPosting schema (job listings with salary data)

Time: 4-6 hours for a developer

Impact: 30-40% improvement in factual accuracy

2. Reformat Careers Page as FAQ

Why: Dramatically improves retrievability and extractability.

What to change:

Replace prose with Q&A structure
Add tables (salary ranges, benefits)
Use bulleted lists
Clear heading hierarchy

Time: 1-2 days for a content writer

Impact: 20-25% increase in citation rate

3. Allow AI Crawlers

Why: If AI can't see your site, you have zero control.

What to check:

robots.txt doesn't block AI bots
sitemap.xml includes careers pages
Site doesn't require JavaScript to render

Time: 30 minutes

Impact: 15-20% increase in retrieval

Conclusion: The Architecture is Your Advantage

Most employer brand teams treat AI visibility like SEO from 2005 — throw keywords at the wall and hope.

But AI isn't a search engine. It's a knowledge synthesis machine with a specific architecture: base knowledge + retrieval + structured data.

If you understand the architecture, you can intervene at each layer:

Training data: Build presence on Wikipedia, Reddit, news
Retrieval: Optimize for search, allow crawlers, structure content
Structured knowledge: Implement schema.org, cross-platform presence

The employers who win in the AI era won't be those with the biggest budgets. They'll be those who understand the machine.

FAQ

Q: Can I pay AI companies to prioritize my content?
A: No. There's no "sponsored content" model for AI responses (yet). Visibility is earned through authority, structure, and relevance.

Q: How often do AI models retrain?
A: Base models: every 6-18 months. RAG retrieval: real-time. Focus your effort on Layer 2 (retrieval) for fastest impact.

Q: If I block AI crawlers, will I still be mentioned?
A: Possibly, if you're famous enough (like Glassdoor). But for most employers, blocking = invisibility.

Q: Do AI models "learn" from user conversations?
A: Indirectly. User feedback influences future training data, but individual conversations don't immediately update the model.

Q: What's the best way to monitor what AI says about me?
A: Run a OpenRole audit monthly. Track changes over time. Alert on major inaccuracies.

How AI Models Really Decide What to Say About Your Company

How AI Models Really Decide What to Say About Your Company

Table of Contents

1. The Three-Layer Knowledge Stack

Layer 1: Base Model (Pre-Training)

Layer 2: Retrieved Context (RAG)

Layer 3: Structured Knowledge (Entity Databases)

How the Layers Interact

2. How Training Data Determines Base Knowledge

What Gets Included in Training Data

Employer-Specific Implications

Training Data Decay

3. Retrieval Augmented Generation (RAG)

Step 1: Query Analysis

Step 2: Search Query Construction

Step 3: Document Retrieval & Ranking

Step 4: Context Injection

Step 5: Answer Synthesis

Why Your Careers Page Might Not Be Retrieved

1. Low Domain Authority

2. Poor Semantic Match

3. Rendering Issues

4. Blocking AI Crawlers

5. No Clear Answers

4. Entity Recognition & Knowledge Graphs

What is an Entity?

Knowledge Graph Structure

How Schema.org Markup Feeds Knowledge Graphs

5. Citation Selection Logic

ChatGPT (GPT-4)

Google AI Overviews

Perplexity

Claude (Anthropic)

6. Model-Specific Differences

Practical Implications

7. Where You Can Intervene

Intervention Point 1: Training Data (Low Control)

Intervention Point 2: Schema.org Markup (High Control)

Intervention Point 3: Content Structure (High Control)

Intervention Point 4: Crawler Access (High Control)

Intervention Point 5: Multi-Platform Presence (Medium Control)

Intervention Point 6: Freshness (Medium Control)

Intervention Point 7: Domain Authority (Low Control)

The Priority Stack

1. Implement Schema.org Markup

2. Reformat Careers Page as FAQ

3. Allow AI Crawlers

Conclusion: The Architecture is Your Advantage

FAQ

Further Reading