Skip to content

Design a System to Detect and Prevent LLM Hallucinations


What We're Building

A customer-facing generative AI product (support assistant, copilot, or Q&A surface) where wrong but confident answers create legal, reputational, and user-trust risk. We need an end-to-end reliability layer that detects, prevents where possible, and contains hallucinations — not a single trick, but defense in depth: grounding with citations, statistical checks, fact verification, calibrated confidence, guardrails, human review, and continuous measurement.

Scope: The system sits around the LLM — ingestion for knowledge, retrieval for grounding, post-generation verification, routing to humans when uncertain, and feedback loops into models and prompts.

Why This Problem Is Hard

Challenge Description
Open-ended outputs Unlike classification, every completion can introduce new unsupported claims
Plausible fluency Models sound authoritative even when wrong; users over-trust fluent text
No single oracle Ground truth is partial (KB stale), expensive (human labels), or absent (novel queries)
Latency vs. depth Strong verification (search, NLI, multi-sample) adds hundreds of ms to seconds
Calibration mismatch Token probabilities are often poorly calibrated for “truth”; low perplexity ≠ factual
Long contexts More retrieved text increases grounding opportunity and contradiction / lost-in-middle risk
Adversarial & edge cases Jailbreaks, leading questions, and domain drift break naive guardrails
Measurement “Hallucination rate” must be defined, sampled, and decomposed (intrinsic vs. extrinsic)

Real-World Scale

Metric Scale (illustrative enterprise / consumer product)
Generations / day 5M–50M (support bot, copilot, or in-app assistant)
Peak generation QPS 200–2,000 (regional peaks 3–5× average)
Knowledge documents 1M–20M chunks in vector + lexical indexes
Human reviewers (FTE-equivalent) 50–500 (queue depth SLA-driven)
Verification calls (NLI + search) 2–10× claim count per answer on strict tiers
Target end-to-end latency (P95) 2–8 s (tiered: fast path vs. high-assurance path)
False omission tolerance Low for regulated answers; product accepts more “I don’t know” in high-risk intents

Warning

No single technique eliminates hallucination. Interviews reward candidates who articulate layered controls, explicit abstention, measurable residual risk, and when to spend compute vs. human time.


Key Concepts Primer

Hallucination Types (Useful Taxonomy)

Type Definition Typical mitigation
Intrinsic Contradicts user prompt or retrieved context Grounding prompts, NLI vs. context, citation enforcement
Extrinsic Not supported by world / KB Search, KG lookup, fact-check pipeline, abstain
Confabulation Fabricated specifics (names, numbers, URLs) Regex/structured validators, allowlists, retrieval-first for facts
flowchart TB subgraph gen["Generation Path"] U[User Query] --> RAG[RAG / Grounding Layer] RAG --> LLM[LLM Generate] LLM --> OUT[Draft Answer] end subgraph verify["Verification Stack"] OUT --> CE[Claim Extraction] CE --> VE[Verification Engine] VE --> CS[Confidence Scorer] CS --> GR[Guardrails] GR --> ROUTE{Policy Router} ROUTE -->|high confidence| SERVE[Serve + Citations] ROUTE -->|low / fail| HITL[Human Queue / Safe Fallback] end

Grounding via RAG and Citations

Grounding ties each substantive statement to evidence spans in retrieved documents. Citations are the UX and audit layer; NLI or entailment checks are the enforcement layer.

Tip

In production, separate “model cited a source” from “source entails the claim.” Citation markers are easy to game; automatic claim–evidence verification catches silent hallucinations.

Self-Consistency

Self-consistency: sample N independent answers (different temperatures or prompts), then compare — e.g., vote on claims, measure pairwise ROUGE/BERTScore, or ask a judge model. High disagreement ⇒ epistemic uncertainty ⇒ downgrade confidence or escalate.

Fact-Checking Pipeline (Claims → Verify)

  1. Segment answer into sentences or atomic claims (subject–predicate–object style or short propositions).
  2. For each claim, retrieve candidate evidence (KB, web search API, KG).
  3. Classify support: supported / refuted / not enough evidence (NLI, retrieval score threshold, or LLM judge with constraints).
  4. Aggregate into answer-level decision: block, rewrite, add disclaimer, or route to human.

Confidence from Log Probabilities

Token logprobs sum (or average per token) to a sequence score. Useful as a cheap signal, not a calibrated probability of truth.

Common aggregates:

  • Mean log-likelihood per token: \(\frac{1}{T} \sum_t \log p(x_t \mid x_{<t})\)
  • Min-token logprob: flags single-token brittleness (rare entities)
  • Per-span aggregation: highlight low-confidence spans for UI or downstream NLI

Note

Open/closed-book behavior differs: low logprob may mean “model is guessing” or “model is memorizing an incorrect fact confidently.” Always combine with external verification for high-stakes claims.

Guardrails

Class Example
Regex / grammar Block fake phone formats, invented SKUs when schema known
Structured KB “If product_id not in catalog API → do not assert price”
LLM judge Second pass: “Is this claim entailed by CONTEXT?” (costly; use on sampled or risky intents)

Human-in-the-Loop (HITL)

Queue tasks with: draft answer, risk score, intent, user tier, SLA, and evidence bundle (retrieved chunks, verification outputs). Review outcomes feed supervised data for calibration, prompt tuning, and fine-tuning.

Calibration and “I Don’t Know”

Fine-tuning / preference optimization (DPO, RLHF) can reward abstention when evidence is weak — e.g., train on (question, context, correct refusal) pairs. Combine with policy: if max(verification_score) < τ, return templated refusal instead of creative completion.


Step 1: Requirements Clarification

Questions to Ask

Question Why It Matters
Which domains (medical, legal, shopping, internal IT)? Drives risk tiering and evidence sources
Is web search allowed for verification? Latency, compliance, and licensing
Latency SLA by intent? Determines which checks run inline vs. async
Languages and locales? NLI models, search indexes, reviewer pools
Audit / retention requirements? Logging of prompts, claims, and reviewer actions
User-visible behavior on failure? Silent downgrade vs. explicit “we’re checking”
Knowledge freshness? KG vs. search vs. crawl; stale KB ⇒ extrinsic hallucinations
Attack model (jailbreak, prompt injection)? Input sanitization, tool isolation

Functional Requirements

Requirement Priority Description
RAG grounding + citations Must have Retrieve evidence; require citation markers for factual intents
Claim extraction Must have Decompose answer into checkable units
Verification vs. KB / search Must have Support / refute / unknown per claim
Confidence scoring Must have Answer- and span-level scores for routing
Self-consistency (optional path) Should have N-sample check for high-risk or low-margin decisions
Guardrails (regex + LLM) Must have Block known failure patterns; consistency checks
Human review queue Must have Escalate low-confidence or policy triggers
Feedback loop Should have Reviewer labels → eval sets → training / prompts
Calibration / abstention policy Should have Template refusals when evidence insufficient
Hallucination rate metrics Must have Online proxies + offline gold sets

Non-Functional Requirements

Requirement Target Rationale
P95 latency (standard tier) 2–4 s Competitive UX for chat
P95 latency (high-assurance tier) 6–12 s acceptable Deep verification + optional HITL async
Verification precision High on “block” path False blocks erode trust; tune thresholds
Availability 99.9%+ Degrade to safe answers if verifiers fail
Cost per answer Bounded Cap NLI/search calls per request; cache results
Privacy Region / tenant isolation Evidence bundles may contain PII

API Design

# POST /v1/assist
# Request
{
    "session_id": "sess-8f3a",
    "user_id": "usr-1029",
    "tenant_id": "acme-corp",
    "query": "What is the warranty period for SKU-4491 in the EU?",
    "tier": "standard",  # standard | high_assurance
    "locale": "en-GB",
    "allow_web_verify": false,
    "stream": true
}

# Response (non-streaming aggregate; streaming sends deltas + final envelope)
{
    "answer": "For SKU-4491, the EU warranty is 24 months from date of purchase...",
    "citations": [
        {"id": 1, "doc_id": "pol-warranty-2024", "span": "EU commercial warranty: 24 months...", "url": "..."}
    ],
    "claims": [
        {
            "id": "c1",
            "text": "EU warranty for SKU-4491 is 24 months from purchase.",
            "verification": "supported",
            "evidence_ids": [1],
            "scores": {"nli_entailment": 0.94, "retrieval": 0.88}
        }
    ],
    "confidence": {
        "answer": 0.91,
        "low_confidence_spans": [{"start": 120, "end": 145, "token_min_logprob": -8.2}]
    },
    "routing": "served",
    "self_consistency": {"n_samples": 5, "pairwise_agreement": 0.86},
    "review_task_id": null,
    "policy_flags": []
}

Step 2: Back-of-Envelope Estimation

Traffic

Daily generations:              10,000,000
Avg peak factor:                4×
Seconds per day:                86,400

Average QPS:                    10M / 86,400 ≈ 116
Peak QPS:                       ~460

High-assurance tier:            10% of traffic → ~46 peak QPS
Claims per answer (avg):        4
NLI inferences / answer (std):  4 claims × 2 passages avg = 8
Peak NLI calls / sec (std):     460 × 8 ≈ 3,700 (batched on GPU)

Storage

Chunk store (text + metadata):   5M chunks × 1.5 KB ≈ 7.5 GB
Vector index (768-dim float32):  5M × 768 × 4 B ≈ 15 GB
Verification cache (hot):        1M entries × 2 KB ≈ 2 GB (TTL 24–72h)
Human review metadata + bundles: 500K tasks × 50 KB ≈ 25 GB / month (compressed, tiered)

Audit log (sampled payloads):     Assume 200 GB / month → cold storage after 7 days

Compute

LLM generation:                  Dominated by GPU serving fleet (separate sizing)
NLI / entailment model:          4–8 ms / pair on GPU at batch 32 → 3.7K QPS needs
                                 ~15–30 GPU workers (with batching + queueing)

Search / KG:                     Mostly network + CPU; cache cuts repeat queries by 40–60%
Self-consistency (N=5):          5× LLM cost for that path — reserve for ≤10% of traffic

Cost (Rough Monthly, Illustrative)

Component Assumption Order of magnitude
LLM tokens 10M answers × 800 out + 2K in Largest line item; varies by model
NLI GPU 30× A10G equivalent @ partial util \(15K–\)50K
Search API 20M queries × $0.001 $20K
Human review 2% escalated × 2 min × loaded wage \(80K–\)400K (highly variable)

Note

Human review often dominates fully loaded cost. The design should minimize escalations via good retrieval and calibrated routing — not eliminate them.


Step 3: High-Level Design

Architecture Overview

flowchart TB subgraph clients["Clients"] WEB[Web / Mobile App] API[Partner API] end subgraph ingest["Ingestion Pipeline"] CONN[Connectors] PARSE[Parser / OCR] CHUNK[Chunker] EMB[Embedding Service] META[Metadata & ACL] end subgraph stores["Knowledge Stores"] VDB[(Vector Index)] LEX[(Lexical / BM25)] DOC[(Document Store)] KG[(Knowledge Graph<br/>optional)] end subgraph ground["Grounding Layer"] RET[Hybrid Retriever] RERANK[Reranker] PB[Prompt Builder<br/>cite-only policy] end subgraph gen["Generation"] LLM[LLM Service] end subgraph post["Post-Generation Quality"] CE[Claim Extraction Service] VE[Verification Engine] CS[Confidence Scorer] GR[Guardrails Service] SC[Self-Consistency<br/>Orchestrator] end subgraph hitl["Human Loop"] RQ[Review Queue] RV[Reviewer UI] FB[Feedback Store] end subgraph ops["Ops & Learning"] MET[Metrics / Hallucination KPIs] EVAL[Offline Eval Pipeline] FT[Calibration FT / DPO Jobs] end WEB --> RET API --> RET ingest --> stores RET --> VDB RET --> LEX RET --> DOC RET --> KG RERANK --> PB --> LLM LLM --> CE --> VE VE --> VDB VE --> LEX VE --> KG VE --> CS --> GR GR -->|pass| clients GR -->|escalate| RQ RV --> FB FB --> EVAL --> FT SC -.->|optional pre-merge| LLM MET --> CE MET --> VE MET --> GR

Component Responsibilities

Component Role
Ingestion pipeline Keeps KB fresh; versioned chunks; ACLs; dedup
Grounding layer Retrieves evidence; builds constrained prompts; optional query decomposition
LLM service Produces draft with citation scaffolding
Claim extraction Sentence/claim segmentation + NER-style predicates for checkability
Verification engine NLI vs. retrieved passages, KG triple match, search corroboration
Confidence scorer Fuses logprobs, verifier scores, retrieval margin, self-consistency
Guardrails Regex, allowlists, injection checks, final LLM judge (tiered)
Human review queue SLA-based assignment; structured disposition codes
Feedback loop Gold errors → prompts, thresholds, and training

Step 4: Deep Dive

4.1 Data Models

from __future__ import annotations

from dataclasses import dataclass, field
from enum import Enum
from typing import Literal


class VerificationLabel(str, Enum):
    SUPPORTED = "supported"
    REFUTED = "refuted"
    INSUFFICIENT = "insufficient_evidence"


@dataclass
class EvidenceSpan:
    doc_id: str
    chunk_id: str
    text: str
    score: float


@dataclass
class Claim:
    id: str
    text: str
    char_span: tuple[int, int]
    entities: list[str] = field(default_factory=list)


@dataclass
class ClaimVerification:
    claim_id: str
    label: VerificationLabel
    evidence: list[EvidenceSpan]
    nli_scores: dict[str, float]  # e.g. entailment, neutral, contradiction


@dataclass
class AnswerArtifact:
    request_id: str
    query: str
    draft_text: str
    citations: list[EvidenceSpan]
    claims: list[Claim]
    verifications: list[ClaimVerification]
    token_logprobs: list[float] | None
    routing: Literal["served", "rewritten", "blocked", "human_pending"]

4.2 Claim Extraction and Verification Pipeline

flowchart LR subgraph in["Input"] A[Draft Answer] CTX[Retrieved Context] end subgraph extract["Claim Extraction"] SEG[Segmenter] ATOM[Atomic Claimifier<br/>LLM or rules] FILTER[Checkability Filter] end subgraph evid["Evidence Acquisition"] LOC[Locate in cited chunks] EXP[Expand retrieval<br/>if needed] SEA[Search / KG Lookup] end subgraph decide["Verification"] NLI[NLI / Entailment] XREF[Cross-reference<br/>agreement] AGG[Label + score] end A --> SEG --> ATOM --> FILTER FILTER --> LOC LOC -->|gap| EXP --> SEA LOC --> NLI SEA --> NLI NLI --> XREF --> AGG CTX --> NLI

Algorithms (claim extraction):

  1. Syntactic split on sentence boundaries (spaCy, blingfire, or language-specific).
  2. Atomic claimification: small LLM prompt: “Split into self-contained factual claims; one JSON per line.”
  3. Checkability filter: drop opinions (“I think”), pure pleasantries, and claims already marked as direct quotes from user.
  4. NLI: premise = concatenated top-k evidence strings; hypothesis = claim; label = argmax(entailment, neutral, contradiction).
# Claim extraction + NLI verification (Python)
from typing import Protocol


class NLIHead(Protocol):
    def predict(self, premise: str, hypothesis: str) -> dict[str, float]:
        """Returns class probabilities, keys: entailment, neutral, contradiction."""
        ...


def segment_sentences(text: str) -> list[str]:
    # Production: use a robust NLP library; placeholder splits on .?!
    import re

    return [s.strip() for s in re.split(r"(?<=[.!?])\s+", text) if s.strip()]


def claims_from_sentences(sentences: list[str], llm_decompose) -> list[str]:
    """Optionally break compound sentences into atomic claims via LLM."""
    atomic: list[str] = []
    for s in sentences:
        sub = llm_decompose(s)  # returns list[str]
        atomic.extend(sub)
    return atomic


def verify_claim(
    claim: str,
    evidence_block: str,
    nli: NLIHead,
    entailment_threshold: float = 0.7,
    contradiction_threshold: float = 0.5,
) -> tuple[VerificationLabel, dict[str, float]]:
    probs = nli.predict(premise=evidence_block[:8000], hypothesis=claim)
    ent = probs.get("entailment", 0.0)
    con = probs.get("contradiction", 0.0)

    if con >= contradiction_threshold:
        return VerificationLabel.REFUTED, probs
    if ent >= entailment_threshold:
        return VerificationLabel.SUPPORTED, probs
    return VerificationLabel.INSUFFICIENT, probs

Java example — verification orchestration sketch (service layer):

public record ClaimVerificationResult(
    String claimId,
    String label,
    double entailment,
    double contradiction
) {}

public final class VerificationService {
    private final NliClient nli;
    private final EvidenceRetriever retriever;

    public ClaimVerificationResult verify(String claim, String tenantId) {
        var evidence = retriever.fetchTopK(claim, tenantId, 5);
        String premise = String.join("\n", evidence.passages());
        var p = nli.score(premise, claim);
        String label = p.contradiction() >= 0.5 ? "refuted"
            : p.entailment() >= 0.7 ? "supported"
            : "insufficient_evidence";
        return new ClaimVerificationResult(
            hashClaim(claim), label, p.entailment(), p.contradiction());
    }
}

Go example — circuit breaker wrapper for NLI RPC:

type NLIClient interface {
    Score(ctx context.Context, premise, hypothesis string) (Probs, error)
}

type BreakerNLI struct {
    inner  NLIClient
    cb     *CircuitBreaker // counts failures, opens circuit
    fallback Probs         // return neutral on open circuit
}

func (b *BreakerNLI) Score(ctx context.Context, premise, hypothesis string) (Probs, error) {
    if b.cb.State() == Open {
        return b.fallback, nil
    }
    p, err := b.inner.Score(ctx, premise, hypothesis)
    if err != nil {
        b.cb.RecordFailure()
        return b.fallback, err
    }
    b.cb.RecordSuccess()
    return p, nil
}

4.3 Verification Strategies

Strategy When to use Pros Cons
Cited-chunk NLI Default; claim maps to [n] Fast, uses already-fetched evidence Misses if citation wrong or missing
Expanded retrieval Low entailment on cited chunk Improves recall Extra latency; may pull wrong doc
KG triple lookup Structured facts (SKU, policy IDs) High precision Coverage gaps; maintenance
Search corroboration Extrinsic facts; fresh data Broad coverage Noise, ranking bias, compliance
Cross-reference High stakes Multiple independent sources agree Cost; agreement ≠ truth (echo chamber)

Cross-reference check: map claim to normalized fact fingerprint (entities + relation type). Require ≥2 independent URLs or ≥1 KG fact + 1 document for auto-approve tier.

4.4 Confidence Scoring

Fusion formula (illustrative): combine verifier, retrieval, and (optional) self-consistency into a single score in [0,1].

Let:

  • \(e^*\) = max entailment probability over evidence passages for the claim
  • \(r^*\) = max retrieval / reranker score for the best passage
  • \(a\) = pairwise agreement among N self-consistency samples (fraction of claim-level matches)
  • \(m\) = min normalized token logprob over answer span (mapped to [0,1] via sigmoid)

Example:

\[ \text{conf}_{\text{claim}} = 0.45 \cdot e^* + 0.25 \cdot r^* + 0.20 \cdot a + 0.10 \cdot m \]

Answer-level confidence: harmonic mean (punishes one weak claim) or min over claims for strict intents.

import math


def sigmoid(x: float) -> float:
    return 1.0 / (1.0 + math.exp(-x))


def token_confidence_from_logprobs(logprobs: list[float]) -> float:
    """Map average logprob to [0,1] — tune scale on held-out data."""
    if not logprobs:
        return 0.5
    avg_lp = sum(logprobs) / len(logprobs)
    # Example calibration: assume avg_lp typically in [-0.5, -8]
    return sigmoid((avg_lp + 3.0) / 2.0)


def span_min_token_flag(
    tokens: list[str],
    logprobs: list[float],
    threshold: float = -6.0,
) -> list[tuple[int, int]]:
    """Return token index ranges where logprob below threshold."""
    bad: list[tuple[int, int]] = []
    for i, lp in enumerate(logprobs):
        if lp < threshold:
            bad.append((i, i + 1))
    return bad


def fuse_claim_confidence(
    entailment: float,
    retrieval: float,
    consistency_agreement: float,
    token_conf: float,
    weights: tuple[float, float, float, float] = (0.45, 0.25, 0.20, 0.10),
) -> float:
    w_e, w_r, w_a, w_t = weights
    return w_e * entailment + w_r * retrieval + w_a * consistency_agreement + w_t * token_conf

4.5 Confidence Scoring Flow

flowchart TB LP[Token Logprobs] --> NORM[Normalize / calibrate] NLI[NLI Entailment<br/>per claim] --> AGG[Per-claim score] RET[Retrieval margin] --> AGG SC[Self-consistency<br/>agreement] --> AGG NORM --> SPAN[Low-confidence<br/>span detection] AGG --> AC[Answer-level<br/>aggregate] SPAN --> AC AC --> POL[Policy thresholds] POL -->|serve| OK[Return answer] POL -->|rewrite| RW[Constrained regen<br/>or strip claims] POL -->|block| BL[Safe refusal] POL -->|queue| HQ[Human review]

4.6 Self-Consistency Checker

import hashlib
from collections import Counter


def normalize_claim_text(c: str) -> str:
    return " ".join(c.lower().split())


def self_consistency_vote(
    samples: list[str],
    claim_extractor,
    nli: NLIHead,
    context: str,
) -> dict:
    """samples: N independent model answers (same prompt, T>0 or diverse prompts)."""
    claim_sets: list[set[str]] = []
    for s in samples:
        claims = claim_extractor(s)
        claim_sets.append({normalize_claim_text(c) for c in claims})

    # Simple agreement: how often each normalized claim appears across samples
    counts: Counter[str] = Counter()
    for cs in claim_sets:
        for c in cs:
            counts[c] += 1

    n = len(samples)
    agreement = {c: counts[c] / n for c in counts}

    # Verify only claims that appear in majority of samples (cost saving)
    verified = {}
    for c, ratio in agreement.items():
        if ratio < 0.5:
            continue
        label, probs = verify_claim(c, context, nli)
        verified[c] = {"support_ratio": ratio, "label": label, "nli": probs}

    pairwise = sum(agreement.values()) / max(1, len(agreement))
    return {"per_claim": verified, "mean_support_ratio": pairwise}

4.7 Guardrails Service

import re
from dataclasses import dataclass


@dataclass
class GuardrailResult:
    ok: bool
    reasons: list[str]


class GuardrailPipeline:
    def __init__(self, patterns: list[tuple[str, str]], llm_judge=None):
        # (name, regex) — e.g. inventing URLs when domain allowlist enforced
        self.patterns = [(n, re.compile(p)) for n, p in patterns]
        self.llm_judge = llm_judge

    def check(self, answer: str, context: str, intent: str) -> GuardrailResult:
        reasons: list[str] = []
        for name, rx in self.patterns:
            if rx.search(answer):
                reasons.append(f"regex_hit:{name}")

        if self.llm_judge and intent in {"medical", "legal"}:
            verdict = self.llm_judge(
                f"CONTEXT:\n{context[:6000]}\n\nANSWER:\n{answer}\n"
                "Reply ONLY yes/no: Is every factual claim entailed by CONTEXT?"
            )
            if "no" in verdict.lower():
                reasons.append("llm_judge:not_entailed")

        return GuardrailResult(ok=len(reasons) == 0, reasons=reasons)

4.8 Fact Verification Pipeline (End-to-End)

class FactVerificationPipeline:
    def __init__(
        self,
        retriever,
        nli: NLIHead,
        search=None,
        kg=None,
    ):
        self.retriever = retriever
        self.nli = nli
        self.search = search
        self.kg = kg

    def run(self, query: str, draft: str, citations: list[EvidenceSpan], tenant: str):
        sentences = segment_sentences(draft)
        claims = []
        for s in sentences:
            claims.extend(atomic_claims_heuristic(s))  # or LLM

        results = []
        for claim in claims:
            block = "\n".join(c.text for c in citations)
            label, probs = verify_claim(claim, block, self.nli)

            if label == VerificationLabel.INSUFFICIENT and self.search:
                hits = self.search.query(claim, tenant=tenant)
                label2, probs2 = verify_claim(claim, "\n".join(h.text for h in hits[:3]), self.nli)
                label, probs = label2, probs2

            if self.kg and looks_structured(claim):
                triple_match = self.kg.lookup(claim)
                if triple_match.conflicts:
                    label = VerificationLabel.REFUTED

            results.append((claim, label, probs))
        return results


def looks_structured(claim: str) -> bool:
    return bool(re.search(r"\bSKU-\d+\b|\bpolicy\s+id\b", claim, re.I))

4.9 Human-in-the-Loop Workflow

sequenceDiagram participant U as User participant API as API Gateway participant V as Verifier Stack participant Q as Review Queue participant R as Reviewer participant L as Learning Pipeline U->>API: query API->>V: draft + evidence V->>V: claims + scores alt confidence OK V->>U: answer else escalate V->>Q: enqueue(task) Q->>R: assign(SLA) R->>Q: disposition(edit/regen/reject) Q->>L: labeled bundle L->>V: threshold / prompt updates Q->>U: async notification (if async path) end

Queue priorities: priority = risk_tier * (1 - confidence) * business_value with WDRR scheduling so low-value traffic cannot starve regulated intents.

4.10 Caching and Circuit Breakers

Cache key TTL Invalidation
(tenant, normalized_claim, evidence_version) 24–72 h KB bump / doc version
NLI logits for (premise_hash, hypothesis_hash) 7 d Model version change
Search results for claim_fingerprint 1–6 h freshness SLO

Circuit breakers: per dependency (NLI GPU, search API, KG). On open circuit: fail closed for “auto-approve” (downgrade to refusal or human queue), never fail open to unverified claims for high-risk intents.


Step 5: Scaling & Production

Failure Handling

Failure Detection Mitigation
NLI service timeout p99 latency SLO breach Skip to retrieval-only score; escalate
Search quota exceeded 429 / error rate KG-only path; cached answers
Retrieval empty recall probe Template “no evidence”; no free-form facts
LLM streaming abort partial JSON Discard partial; retry or safe apology
Review queue backlog depth / age Auto-message; widen abstention; staff shift

Monitoring

Metric Description
Hallucination proxy rate Fraction of answers with any REFUTED or high-contradiction claim
Abstention rate Refusals / total (track by intent)
Escalation rate Human queue inserts / total
Time-to-resolution (HITL) P95 reviewer latency
NLI / search error budget Burn rate for circuit breakers
Self-consistency spread Distribution of pairwise disagreement (drift alert)
Citation–claim alignment % claims with valid citation index

Trade-offs

Decision Option A Option B Guidance
Verification depth Always full pipeline Tier by intent Tiered is standard
Self-consistency Every request High-risk only Control cost
LLM judge guardrail Always on Sampled / escalations only Balance cost & recall
User experience Block silently Explain uncertainty Regulated domains favor transparency
Metric definition Single “hallscore” Decomposed KPIs Decomposed drives actionable fixes

Production Measurement of Hallucination Rate

Offline (gold set):

  • Human annotators label each claim: factual / unsupported / contradicted vs. evidence set.
  • Report claim-level precision/recall for “supported” and answer-level strict pass (all claims supported).

Online (without labels):

  • NLI-based proxy: rate of INSUFFICIENT or REFUTED after full verification.
  • User signals: thumbs down + “incorrect” reason; reviewer sampling.
  • Contrast sets: periodic red-team prompts with known answers.
\[ \text{hallucination\_proxy} \approx \frac{\#\text{answers with ≥1 refuted or high-risk insufficient}}{\#\text{answers}} \]

Warning

Proxy metrics drift when NLI or retrieval changes. Version models and re-calibrate thresholds on a fixed eval suite every release.


Interview Tips

Tip

Strong answers often hit these follow-up themes: 1. Defense in depth — which layer catches which failure mode?
2. Calibration honesty — logprobs are not probabilities of truth; what do you combine them with?
3. Cost cap — how many NLI/search calls per request? Caching keys?
4. Product policy — when do you refuse vs. show with disclaimer vs. human?
5. Evaluation — separate retrieval failures from generation hallucinations.
6. Latency — async HITL vs. synchronous blocking.
7. Security — prompt injection via retrieved docs poisoning verification context.

Common follow-up questions:

  • How do you handle conflicting sources?
  • What if the user asks for a guess?
  • How do you A/B test a new verifier without harming users?
  • Multilingual — one NLI model or per locale?
  • How do you prove improvement to legal/compliance?

Hypothetical Interview Transcript

Note

Simulated 45-minute Google-style system design conversation. Candidate is targeting L5/L6; focus is reliability, metrics, and trade-offs.


Interviewer: Design a system that detects and prevents hallucinations in a customer-facing LLM product. You can assume we already have an LLM and some kind of knowledge base.

Candidate: I’ll treat “hallucination” as unsupported or contradicted factual claims in model outputs, relative to allowed evidence — retrieved docs, KG, or approved search — depending on policy. I’ll walk through defense in depth: grounding, automatic verification, confidence routing, guardrails, humans, and measurement.

First, clarifying questions: What domains matter — are we in regulated advice? Is web search allowed as evidence? What latency do we need for the default path, and do we have a high-assurance mode? Also, single-tenant enterprise vs. consumer scale?

Interviewer: Mix of shopping support and internal policy Q&A. Web search is allowed only for shopping SKU facts, not for HR policy. P95 3 seconds on default; we can go async for human review. Think 10M generations per day.

Candidate: Got it. I’ll split routing by intent: policy answers use enterprise KB + citations only; shopping uses KB + product API + optional search for freshness.

At a high level: ingestion keeps chunked text, embeddings, and metadata fresh. On each request, a grounding layer retrieves evidence and builds a strict prompt: answer only from context, cite sources, abstain if missing. The model produces a draft. Then a claim extraction step splits the answer into checkable claims. For each claim, a verification engine runs NLI against the cited passages first; if entailment is low, we expand retrieval or call search (shopping) or KG for structured attributes. A confidence scorer fuses entailment scores, retrieval margin, optional self-consistency across a few samples on high-risk queries, and token logprobs mostly for highlighting uncertain spans, not as a sole truth signal. Guardrails catch structured issues — fake URL patterns, SKU formats that don’t exist in catalog API — before we serve. If overall confidence is below policy thresholds, we rewrite with stricter instructions, block with a safe template, or enqueue human review with the evidence bundle.

Interviewer: Why not rely on RAG and citations alone?

Candidate: Citations reduce unsourced hallucinations but don’t guarantee faithfulness. Models miscite, blend chunks, or paraphrase into something not entailed. So we need post-hoc verification — at least NLI between each atomic claim and the evidence we think supports it. That’s the standard enterprise pattern beyond “please cite your sources.”

Interviewer: Explain self-consistency and when you’d use it.

Candidate: We sample N completions — different temperature or prompt jitter — extract claims from each, and measure agreement. If the model is epistemically uncertain, independent samples often disagree on specifics. I’d use this only on high-risk or low-margin classifications because it’s generation cost. It’s a useful signal to combine with NLI, not replace external verification.

Interviewer: How do log probabilities fit in?

Candidate: They’re cheap from the inference engine. I’d compute per-token logprobs, flag low minima or low average spans for UI shading and for down-weighting in a fused confidence score. But many models are overconfident on false facts, so I would never gate compliance-critical decisions on logprobs alone — only as one feature alongside NLI and retrieval strength.

Interviewer: Walk me through the human review path.

Candidate: When fused confidence is below τ, or guardrails fire, or NLI finds contradiction, we create a review task: user query, draft answer, extracted claims with verifier outputs, retrieved passages, and model versions. Reviewers pick a disposition: approve as-is, edit, replace with template, or escalate to policy. Outcomes feed an offline eval set and threshold tuning. For UX, if we’re async, we return: “We’re double-checking this” with a ticket id; for sync low-latency paths, we might refuse immediately instead of blocking the user for minutes.

Interviewer: How do you measure hallucinations in production?

Candidate: Three layers: offline human-labeled claim judgments on a frozen golden set — report precision/recall of supported claims and strict answer pass rate. Online proxies: rate of refuted or high-contradiction claims after verification, abstention rate by intent, and user thumbs-down with reason codes. Red-team suites with known facts to catch regressions. I’d version verifier models and re-tune thresholds when NLI changes, because proxies drift.

Interviewer: What about failures in the verifier stack?

Candidate: Circuit breakers on NLI and search. For high-risk intents, fail closed: if we can’t verify, we don’t present speculative facts — we fall back to human queue or safe refusal. For low-risk chitchat, we might fail open to a generic answer without factual assertions — policy-dependent. Caches keyed by claim hash + evidence version save cost and stabilize latency.

Interviewer: Mention fine-tuning briefly.

Candidate: We can DPO/RLHF toward calibrated abstention: reward the model for saying “not enough information” when retrieval scores are weak or when a verifier head — possibly distilled from NLI — predicts low support. That reduces blatant confabulation but doesn’t remove the need for the external stack; it aligns the prior with our policy.

Interviewer: Sounds good. Biggest trade-off?

Candidate: Automation vs. human cost. Aggressive verification and HITL improve correctness but hurt latency and margin. The product win is tiered assurance: default fast path with solid NLI on cited chunks; deeper checks only when intent or user value warrants it — and metrics that prove each layer actually moves hallucination proxies down.

Interviewer: Thanks — that’s a wrap.


Summary

This design frames hallucination control as a layered production system: RAG grounding and citations set the evidentiary perimeter; claim extraction and NLI-based verification enforce faithfulness to that perimeter; search, KG, and cross-reference address extrinsic facts where allowed; self-consistency and token logprobs add cheap uncertainty signals; guardrails catch structured failure modes; human-in-the-loop handles residual risk and supplies labeled feedback; calibration-oriented training aligns the generator with abstention policies. Circuit breakers, caching, and intent-based tiering keep the system available and affordable. Finally, decomposed metrics — offline human eval, online verifier proxies, and red-team regressions — make hallucination rate measurable and actionable over time.