GenAI System Design¶
Production system design for Generative AI — 15 designs covering chatbots, RAG, document Q&A, agents, code assistants, hallucination detection, fine-tuning, evaluation pipelines, prompt management, image generation, and more. Each includes a hypothetical Google-style interview transcript.
Why GenAI System Design Is a Separate Category¶
Traditional ML system design focuses on classification, ranking, and retrieval. GenAI introduces fundamentally different challenges:
| Traditional ML | GenAI/LLM Systems |
|---|---|
| Fixed output schema (class, score) | Open-ended text/image/code generation |
| Millisecond inference | Seconds-long autoregressive decoding |
| MBs per model | GBs–TBs per model (GPU clusters) |
| Feature engineering dominant | Prompt engineering + retrieval dominant |
| Train once, serve forever | Continuous alignment, RLHF, safety tuning |
| Deterministic evaluation | Subjective, multi-dimensional evaluation |
Warning
Google, Meta, and Anthropic interviews increasingly ask GenAI-specific designs. Saying "just call the OpenAI API" will not pass. You need to demonstrate understanding of inference optimization, safety, grounding, cost control, and evaluation at scale.
Recommended Study Order¶
Tip
Follow this progression. Each design builds on concepts from earlier ones. All designs include a full hypothetical interview transcript.
Phase 1: Core GenAI Patterns¶
| Order | Design | New Concepts Introduced | Why First |
|---|---|---|---|
| 1 | LLM-Powered Chatbot | KV-cache, PagedAttention, streaming, safety | Foundation for all LLM serving |
| 2 | Enterprise RAG System | Chunking, hybrid retrieval, citations, ACLs | #1 production LLM pattern |
| 3 | LLM Gateway | Multi-model routing, semantic caching, cost control | Infra layer used by all LLM apps |
Phase 2: Specialized Applications¶
| Order | Design | New Concepts Introduced | Builds On |
|---|---|---|---|
| 4 | Document Q&A System | PDF parsing, OCR, cross-encoder re-ranking, citations | RAG (retrieval) |
| 5 | AI Code Assistant | FIM, repo context, speculative decoding | Chatbot (serving) + RAG (retrieval) |
| 6 | AI Agent System | ReAct, tool use, planning, memory, multi-agent | Chatbot + RAG + Gateway |
| 7 | LLM Content Moderation | Cascade, adversarial robustness, human-in-the-loop | Chatbot (safety pipeline) |
| 8 | Hallucination Detection | Claim extraction, NLI verification, confidence scoring | RAG + Moderation |
Phase 3: Advanced GenAI¶
| Order | Design | New Concepts Introduced | Builds On |
|---|---|---|---|
| 9 | Text-to-Image Generation | Diffusion models, CFG, latent space, safety | Chatbot (GPU serving) + Moderation |
| 10 | Multi-Modal Search | CLIP/SigLIP, cross-modal retrieval, video | RAG (retrieval) + Image generation |
| 11 | ML Training Platform | Gang scheduling, checkpointing, GPU clusters | All (training infrastructure for all models) |
| 12 | LLM Fine-Tuning Platform | LoRA/QLoRA, VPC data plane, eval gates, blue-green | RAG + Training Platform |
Phase 4: LLM Operations & Infrastructure¶
| Order | Design | New Concepts Introduced | Builds On |
|---|---|---|---|
| 13 | LLM Evaluation Pipeline | LLM-as-judge, Elo rating, benchmarks, A/B testing | All (quality assurance for all LLM systems) |
| 14 | Prompt Management & Versioning | Prompt registry, templating, environment promotion | Gateway + Evaluation Pipeline |
Available Designs¶
LLM-Powered Chatbot¶
Conversational AI
Design a production chatbot like Gemini/ChatGPT — multi-turn conversation, streaming, safety, and serving at Google scale.
Key concepts: KV-cache, PagedAttention, speculative decoding, RLHF, guardrails, conversation memory, streaming SSE, GPU auto-scaling
Difficulty: ⭐⭐⭐⭐⭐ Very Hard
Enterprise RAG System¶
Knowledge Grounding
Design a retrieval-augmented generation system for enterprise knowledge bases with citations, access control, and hallucination mitigation.
Key concepts: Chunking strategies, hybrid retrieval (BM25 + dense), re-ranking, citation grounding, ACL-aware retrieval, query routing, evaluation (faithfulness, relevance)
Difficulty: ⭐⭐⭐⭐ Hard
LLM Gateway¶
AI Infrastructure
NEW
Design an LLM gateway/proxy that handles routing, fallback, semantic caching, rate limiting, cost tracking, and observability across multiple LLM providers.
Key concepts: Semantic caching, intelligent model routing, token-based rate limiting, circuit breaker per provider, PII scrubbing, cost attribution, unified API normalization
Difficulty: ⭐⭐⭐⭐ Hard
Document Q&A System¶
Enterprise Knowledge
NEW
Design a document Q&A system that handles 10,000+ PDFs — PDF parsing, table/image extraction, chunking, hybrid retrieval with re-ranking, and citation-grounded generation.
Key concepts: PDF parsing (PyMuPDF, Unstructured), OCR, recursive chunking, bi-encoder embeddings, HNSW, BM25 + dense hybrid retrieval, cross-encoder re-ranking, citation extraction, incremental indexing, ACL-aware retrieval
Difficulty: ⭐⭐⭐⭐ Hard
Hallucination Detection & Prevention¶
Trust & Safety
NEW
Design a system to detect and prevent LLM hallucinations in customer-facing products — claim extraction, fact verification, confidence scoring, and defense-in-depth guardrails.
Key concepts: Claim extraction (NLI), fact verification (knowledge graph + search), self-consistency checking, token-level confidence scoring, output grounding, guardrails, human-in-the-loop review, calibration fine-tuning
Difficulty: ⭐⭐⭐⭐⭐ Very Hard
AI Code Assistant¶
Developer Tools
Design an AI code completion and chat system like Gemini Code Assist / GitHub Copilot — IDE integration, repository-aware context, and low-latency suggestions.
Key concepts: Fill-in-the-middle (FIM), tree-sitter AST, repository-level context, speculative decoding, streaming, telemetry-driven evaluation, code safety
Difficulty: ⭐⭐⭐⭐⭐ Very Hard
AI Agent System¶
Autonomous AI
NEW
Design an autonomous AI agent system that can plan, use tools, maintain memory, and execute multi-step tasks — like Google's AI agents or Anthropic's computer use.
Key concepts: ReAct pattern, tool calling, task decomposition, working + semantic memory, multi-agent orchestration, sandboxed execution, human-in-the-loop
Difficulty: ⭐⭐⭐⭐⭐ Very Hard
LLM Content Moderation¶
Trust & Safety
Design a content moderation system using LLMs for text, image, and video — policy enforcement, appeals, and human-in-the-loop at scale.
Key concepts: Multi-modal classifiers, policy-as-code, cascade architecture (fast→accurate), adversarial robustness, human review queues, appeals, regulatory compliance
Difficulty: ⭐⭐⭐⭐ Hard
Text-to-Image Generation¶
Generative Media
NEW
Design a text-to-image generation system like Imagen / DALL-E 3 / Midjourney — from text prompts to high-quality images with safety and copyright controls.
Key concepts: Diffusion models, latent diffusion, classifier-free guidance, CLIP/T5 conditioning, ControlNet, LoRA, super-resolution cascades, content provenance (C2PA)
Difficulty: ⭐⭐⭐⭐⭐ Very Hard
Multi-Modal Search¶
Multi-Modal AI
Design a multi-modal search system like Google Lens — search across text, images, video, and audio with a unified embedding space.
Key concepts: CLIP/SigLIP embeddings, unified vector index, cross-modal retrieval, OCR integration, late-interaction models, query understanding, multi-modal fusion
Difficulty: ⭐⭐⭐⭐ Hard
ML Training Platform¶
ML Infrastructure
Design an ML training platform like Vertex AI / SageMaker — job scheduling, distributed training, experiment tracking, and GPU cluster management.
Key concepts: GPU scheduling (gang scheduling), checkpointing, elastic training, experiment tracking, hyperparameter tuning, multi-tenancy, cost attribution
Difficulty: ⭐⭐⭐⭐⭐ Very Hard
LLM Fine-Tuning Platform¶
GenAI Training
Design a VPC-native LLM fine-tuning platform for private enterprise data — curation, LoRA/QLoRA, DP, experiment tracking, evaluation vs. base, and blue-green deployment.
Key concepts: Instruction SFT formatting, deduplication, LoRA/QLoRA, gradient accumulation, checkpointing, model merging, MLflow/W&B, differential privacy, A/B and shadow eval, catastrophic forgetting
Difficulty: ⭐⭐⭐⭐⭐ Very Hard
LLM Evaluation Pipeline¶
LLM Operations
NEW
Design an evaluation pipeline for LLM-based products — automated benchmarks, LLM-as-judge, human evaluation with Elo ratings, safety testing, A/B testing, and golden dataset regression.
Key concepts: MMLU/HumanEval/GSM8K benchmarks, BLEU/ROUGE/pass@k metrics, LLM-as-judge with calibration, Elo rating computation, red-teaming, golden dataset regression, statistical significance testing, online vs offline evaluation
Difficulty: ⭐⭐⭐⭐ Hard
Prompt Management & Versioning¶
LLM Operations
NEW
Design a prompt management and versioning system — prompt registry, Jinja templating, A/B testing, golden dataset evaluation, environment promotion, and one-click rollback.
Key concepts: Prompt registry, version DAG, Jinja templating, semantic diff, A/B traffic splitting, golden dataset evaluation, environment promotion (dev → staging → prod), rollback, audit trail, prompt composition and chaining
Difficulty: ⭐⭐⭐⭐ Hard
Vector Database¶
AI Infrastructure
NEW
Design a vector database purpose-built for AI applications — HNSW, IVF-PQ indexing, hybrid search with metadata filtering, and billion-scale ANN at sub-10ms latency.
Key concepts: HNSW graph traversal, IVF-PQ quantization, distance metrics (cosine, L2, IP), hybrid pre/post-filtering, vector-space-aware sharding, memory-mapped indexes, tiered storage
Difficulty: ⭐⭐⭐⭐⭐ Very Hard
Quick Reference: System Comparison¶
| System | Latency Target | Key Challenge | Primary Metric |
|---|---|---|---|
| LLM Chatbot | TTFT < 500ms | GPU cost, safety | User satisfaction, Helpfulness |
| Enterprise RAG | < 3s end-to-end | Hallucination, ACLs | Faithfulness, Recall@K |
| LLM Gateway | < 20ms overhead | Multi-provider resilience | Availability, cost savings |
| AI Code Assistant | < 200ms (completion) | Context window, accuracy | Acceptance rate, Keystroke savings |
| AI Agent System | < 5min per task | Planning, tool reliability | Task completion rate |
| Content Moderation | < 500ms | Adversarial attacks, fairness | Precision, Recall, FPR |
| Text-to-Image | < 10s per image | Safety, quality | FID, CLIP score, Human preference |
| Vector Database | < 10ms P99 | Billion-scale ANN, recall | Recall@K, QPS |
| Multi-Modal Search | < 300ms | Cross-modal alignment | NDCG@K, Recall@K |
| Document Q&A | < 3s end-to-end | PDF parsing, multi-doc queries | Answer accuracy, Citation precision |
| Hallucination Detection | < 1s overhead | Claim verification at scale | Hallucination rate, Precision/Recall |
| ML Training Platform | N/A (throughput) | GPU utilization, fault tolerance | MFU, Job completion rate |
| LLM Fine-Tuning Platform | N/A (train) / inherits inference | Data residency, eval gates, forgetting | Win-rate vs base, ε budget, rollback time |
| LLM Evaluation Pipeline | < 30min per eval run | Subjective quality, no ground truth | Benchmark scores, Elo ratings |
| Prompt Management | < 10ms resolution | Version drift, A/B correctness | Prompt win-rate, Rollback time |
GenAI System Design Framework¶
Use this adapted framework in your interviews:
1. Problem Setup (5 min)¶
- What modality? (text, image, video, code, multi-modal)
- What is the generation/retrieval task?
- Latency, throughput, cost constraints?
- Safety and compliance requirements?
- Success metrics (both ML and business)
2. Model & Data Strategy (10 min)¶
- Base model selection (size, architecture, open vs proprietary)
- Fine-tuning vs prompt engineering vs RAG
- Training data: collection, labeling, quality
- Alignment strategy (RLHF, DPO, Constitutional AI)
- Evaluation methodology
3. Serving Infrastructure (10 min)¶
- GPU cluster sizing and instance types
- Inference optimization (batching, quantization, KV-cache)
- Streaming architecture (SSE, WebSocket)
- Auto-scaling strategy
- Cost optimization (spot instances, model distillation)
4. Safety & Quality (5 min)¶
- Input/output guardrails
- Hallucination mitigation
- PII detection and filtering
- Adversarial robustness
- Human-in-the-loop workflows
5. Monitoring & Iteration (5 min)¶
- Online evaluation metrics
- A/B testing framework
- Feedback collection
- Retraining and model update pipeline
- Cost and latency dashboards
Interview Tips for GenAI Design¶
Tip
What differentiates a pass from a strong-hire at Google: - Pass: Correct high-level architecture with RAG or fine-tuning - Strong hire: Deep discussion of inference optimization (PagedAttention, continuous batching), safety layering, evaluation methodology, and cost modeling
Common mistakes to avoid: 1. Treating LLM inference like traditional web service scaling (it's GPU-bound, not CPU-bound) 2. Ignoring safety and guardrails entirely 3. Not discussing how to evaluate generation quality 4. Assuming unlimited context windows solve all problems 5. Forgetting about cost — GPU inference is 100-1000x more expensive than traditional APIs 6. Not separating retrieval from generation in RAG systems 7. Ignoring multi-provider resilience (single provider = single point of failure) 8. Treating agents as simple chatbots (planning, memory, tool use are distinct subsystems)
How These Connect¶
GenAI/ML Fundamentals GenAI System Design Questions
───────────────────── ─────────────────────────────
Model Serving ──────► LLM Chatbot, Code Assistant, LLM Gateway
LLM Systems ──────► Enterprise RAG, Chatbot, AI Agents, Document Q&A
Feature Stores ──────► Content Moderation
Distributed Training ──────► ML Training Platform, Text-to-Image, Fine-Tuning Platform
Data Pipelines ──────► All GenAI Systems
LLM Evaluation ──────► LLM Evaluation Pipeline, Hallucination Detection
RLHF & Alignment ──────► Fine-Tuning Platform, Prompt Management
ML System Design GenAI System Design Questions
──────────────── ─────────────────────────────
Image Search ──────► Multi-Modal Search, Document Q&A
Search Ranking ──────► Enterprise RAG (retrieval), Document Q&A
Fraud Detection ──────► Content Moderation (cascade), Hallucination Detection
Ads Ranking ──────► LLM Gateway (cost optimization)
Note
Master the GenAI/ML Fundamentals building blocks first, then apply them here in full end-to-end system designs.