Skip to content

GenAI System Design

Production system design for Generative AI — 15 designs covering chatbots, RAG, document Q&A, agents, code assistants, hallucination detection, fine-tuning, evaluation pipelines, prompt management, image generation, and more. Each includes a hypothetical Google-style interview transcript.


Why GenAI System Design Is a Separate Category

Traditional ML system design focuses on classification, ranking, and retrieval. GenAI introduces fundamentally different challenges:

Traditional ML GenAI/LLM Systems
Fixed output schema (class, score) Open-ended text/image/code generation
Millisecond inference Seconds-long autoregressive decoding
MBs per model GBs–TBs per model (GPU clusters)
Feature engineering dominant Prompt engineering + retrieval dominant
Train once, serve forever Continuous alignment, RLHF, safety tuning
Deterministic evaluation Subjective, multi-dimensional evaluation

Warning

Google, Meta, and Anthropic interviews increasingly ask GenAI-specific designs. Saying "just call the OpenAI API" will not pass. You need to demonstrate understanding of inference optimization, safety, grounding, cost control, and evaluation at scale.


Tip

Follow this progression. Each design builds on concepts from earlier ones. All designs include a full hypothetical interview transcript.

Phase 1: Core GenAI Patterns

Order Design New Concepts Introduced Why First
1 LLM-Powered Chatbot KV-cache, PagedAttention, streaming, safety Foundation for all LLM serving
2 Enterprise RAG System Chunking, hybrid retrieval, citations, ACLs #1 production LLM pattern
3 LLM Gateway Multi-model routing, semantic caching, cost control Infra layer used by all LLM apps

Phase 2: Specialized Applications

Order Design New Concepts Introduced Builds On
4 Document Q&A System PDF parsing, OCR, cross-encoder re-ranking, citations RAG (retrieval)
5 AI Code Assistant FIM, repo context, speculative decoding Chatbot (serving) + RAG (retrieval)
6 AI Agent System ReAct, tool use, planning, memory, multi-agent Chatbot + RAG + Gateway
7 LLM Content Moderation Cascade, adversarial robustness, human-in-the-loop Chatbot (safety pipeline)
8 Hallucination Detection Claim extraction, NLI verification, confidence scoring RAG + Moderation

Phase 3: Advanced GenAI

Order Design New Concepts Introduced Builds On
9 Text-to-Image Generation Diffusion models, CFG, latent space, safety Chatbot (GPU serving) + Moderation
10 Multi-Modal Search CLIP/SigLIP, cross-modal retrieval, video RAG (retrieval) + Image generation
11 ML Training Platform Gang scheduling, checkpointing, GPU clusters All (training infrastructure for all models)
12 LLM Fine-Tuning Platform LoRA/QLoRA, VPC data plane, eval gates, blue-green RAG + Training Platform

Phase 4: LLM Operations & Infrastructure

Order Design New Concepts Introduced Builds On
13 LLM Evaluation Pipeline LLM-as-judge, Elo rating, benchmarks, A/B testing All (quality assurance for all LLM systems)
14 Prompt Management & Versioning Prompt registry, templating, environment promotion Gateway + Evaluation Pipeline

Available Designs

LLM-Powered Chatbot

Conversational AI

Design a production chatbot like Gemini/ChatGPT — multi-turn conversation, streaming, safety, and serving at Google scale.

Key concepts: KV-cache, PagedAttention, speculative decoding, RLHF, guardrails, conversation memory, streaming SSE, GPU auto-scaling

Difficulty: ⭐⭐⭐⭐⭐ Very Hard


Enterprise RAG System

Knowledge Grounding

Design a retrieval-augmented generation system for enterprise knowledge bases with citations, access control, and hallucination mitigation.

Key concepts: Chunking strategies, hybrid retrieval (BM25 + dense), re-ranking, citation grounding, ACL-aware retrieval, query routing, evaluation (faithfulness, relevance)

Difficulty: ⭐⭐⭐⭐ Hard


LLM Gateway

AI Infrastructure

NEW

Design an LLM gateway/proxy that handles routing, fallback, semantic caching, rate limiting, cost tracking, and observability across multiple LLM providers.

Key concepts: Semantic caching, intelligent model routing, token-based rate limiting, circuit breaker per provider, PII scrubbing, cost attribution, unified API normalization

Difficulty: ⭐⭐⭐⭐ Hard


Document Q&A System

Enterprise Knowledge

NEW

Design a document Q&A system that handles 10,000+ PDFs — PDF parsing, table/image extraction, chunking, hybrid retrieval with re-ranking, and citation-grounded generation.

Key concepts: PDF parsing (PyMuPDF, Unstructured), OCR, recursive chunking, bi-encoder embeddings, HNSW, BM25 + dense hybrid retrieval, cross-encoder re-ranking, citation extraction, incremental indexing, ACL-aware retrieval

Difficulty: ⭐⭐⭐⭐ Hard


Hallucination Detection & Prevention

Trust & Safety

NEW

Design a system to detect and prevent LLM hallucinations in customer-facing products — claim extraction, fact verification, confidence scoring, and defense-in-depth guardrails.

Key concepts: Claim extraction (NLI), fact verification (knowledge graph + search), self-consistency checking, token-level confidence scoring, output grounding, guardrails, human-in-the-loop review, calibration fine-tuning

Difficulty: ⭐⭐⭐⭐⭐ Very Hard


AI Code Assistant

Developer Tools

Design an AI code completion and chat system like Gemini Code Assist / GitHub Copilot — IDE integration, repository-aware context, and low-latency suggestions.

Key concepts: Fill-in-the-middle (FIM), tree-sitter AST, repository-level context, speculative decoding, streaming, telemetry-driven evaluation, code safety

Difficulty: ⭐⭐⭐⭐⭐ Very Hard


AI Agent System

Autonomous AI

NEW

Design an autonomous AI agent system that can plan, use tools, maintain memory, and execute multi-step tasks — like Google's AI agents or Anthropic's computer use.

Key concepts: ReAct pattern, tool calling, task decomposition, working + semantic memory, multi-agent orchestration, sandboxed execution, human-in-the-loop

Difficulty: ⭐⭐⭐⭐⭐ Very Hard


LLM Content Moderation

Trust & Safety

Design a content moderation system using LLMs for text, image, and video — policy enforcement, appeals, and human-in-the-loop at scale.

Key concepts: Multi-modal classifiers, policy-as-code, cascade architecture (fast→accurate), adversarial robustness, human review queues, appeals, regulatory compliance

Difficulty: ⭐⭐⭐⭐ Hard


Text-to-Image Generation

Generative Media

NEW

Design a text-to-image generation system like Imagen / DALL-E 3 / Midjourney — from text prompts to high-quality images with safety and copyright controls.

Key concepts: Diffusion models, latent diffusion, classifier-free guidance, CLIP/T5 conditioning, ControlNet, LoRA, super-resolution cascades, content provenance (C2PA)

Difficulty: ⭐⭐⭐⭐⭐ Very Hard


Multi-Modal AI

Design a multi-modal search system like Google Lens — search across text, images, video, and audio with a unified embedding space.

Key concepts: CLIP/SigLIP embeddings, unified vector index, cross-modal retrieval, OCR integration, late-interaction models, query understanding, multi-modal fusion

Difficulty: ⭐⭐⭐⭐ Hard


ML Training Platform

ML Infrastructure

Design an ML training platform like Vertex AI / SageMaker — job scheduling, distributed training, experiment tracking, and GPU cluster management.

Key concepts: GPU scheduling (gang scheduling), checkpointing, elastic training, experiment tracking, hyperparameter tuning, multi-tenancy, cost attribution

Difficulty: ⭐⭐⭐⭐⭐ Very Hard


LLM Fine-Tuning Platform

GenAI Training

Design a VPC-native LLM fine-tuning platform for private enterprise data — curation, LoRA/QLoRA, DP, experiment tracking, evaluation vs. base, and blue-green deployment.

Key concepts: Instruction SFT formatting, deduplication, LoRA/QLoRA, gradient accumulation, checkpointing, model merging, MLflow/W&B, differential privacy, A/B and shadow eval, catastrophic forgetting

Difficulty: ⭐⭐⭐⭐⭐ Very Hard


LLM Evaluation Pipeline

LLM Operations

NEW

Design an evaluation pipeline for LLM-based products — automated benchmarks, LLM-as-judge, human evaluation with Elo ratings, safety testing, A/B testing, and golden dataset regression.

Key concepts: MMLU/HumanEval/GSM8K benchmarks, BLEU/ROUGE/pass@k metrics, LLM-as-judge with calibration, Elo rating computation, red-teaming, golden dataset regression, statistical significance testing, online vs offline evaluation

Difficulty: ⭐⭐⭐⭐ Hard


Prompt Management & Versioning

LLM Operations

NEW

Design a prompt management and versioning system — prompt registry, Jinja templating, A/B testing, golden dataset evaluation, environment promotion, and one-click rollback.

Key concepts: Prompt registry, version DAG, Jinja templating, semantic diff, A/B traffic splitting, golden dataset evaluation, environment promotion (dev → staging → prod), rollback, audit trail, prompt composition and chaining

Difficulty: ⭐⭐⭐⭐ Hard


Vector Database

AI Infrastructure

NEW

Design a vector database purpose-built for AI applications — HNSW, IVF-PQ indexing, hybrid search with metadata filtering, and billion-scale ANN at sub-10ms latency.

Key concepts: HNSW graph traversal, IVF-PQ quantization, distance metrics (cosine, L2, IP), hybrid pre/post-filtering, vector-space-aware sharding, memory-mapped indexes, tiered storage

Difficulty: ⭐⭐⭐⭐⭐ Very Hard


Quick Reference: System Comparison

System Latency Target Key Challenge Primary Metric
LLM Chatbot TTFT < 500ms GPU cost, safety User satisfaction, Helpfulness
Enterprise RAG < 3s end-to-end Hallucination, ACLs Faithfulness, Recall@K
LLM Gateway < 20ms overhead Multi-provider resilience Availability, cost savings
AI Code Assistant < 200ms (completion) Context window, accuracy Acceptance rate, Keystroke savings
AI Agent System < 5min per task Planning, tool reliability Task completion rate
Content Moderation < 500ms Adversarial attacks, fairness Precision, Recall, FPR
Text-to-Image < 10s per image Safety, quality FID, CLIP score, Human preference
Vector Database < 10ms P99 Billion-scale ANN, recall Recall@K, QPS
Multi-Modal Search < 300ms Cross-modal alignment NDCG@K, Recall@K
Document Q&A < 3s end-to-end PDF parsing, multi-doc queries Answer accuracy, Citation precision
Hallucination Detection < 1s overhead Claim verification at scale Hallucination rate, Precision/Recall
ML Training Platform N/A (throughput) GPU utilization, fault tolerance MFU, Job completion rate
LLM Fine-Tuning Platform N/A (train) / inherits inference Data residency, eval gates, forgetting Win-rate vs base, ε budget, rollback time
LLM Evaluation Pipeline < 30min per eval run Subjective quality, no ground truth Benchmark scores, Elo ratings
Prompt Management < 10ms resolution Version drift, A/B correctness Prompt win-rate, Rollback time

GenAI System Design Framework

Use this adapted framework in your interviews:

1. Problem Setup (5 min)

  • What modality? (text, image, video, code, multi-modal)
  • What is the generation/retrieval task?
  • Latency, throughput, cost constraints?
  • Safety and compliance requirements?
  • Success metrics (both ML and business)

2. Model & Data Strategy (10 min)

  • Base model selection (size, architecture, open vs proprietary)
  • Fine-tuning vs prompt engineering vs RAG
  • Training data: collection, labeling, quality
  • Alignment strategy (RLHF, DPO, Constitutional AI)
  • Evaluation methodology

3. Serving Infrastructure (10 min)

  • GPU cluster sizing and instance types
  • Inference optimization (batching, quantization, KV-cache)
  • Streaming architecture (SSE, WebSocket)
  • Auto-scaling strategy
  • Cost optimization (spot instances, model distillation)

4. Safety & Quality (5 min)

  • Input/output guardrails
  • Hallucination mitigation
  • PII detection and filtering
  • Adversarial robustness
  • Human-in-the-loop workflows

5. Monitoring & Iteration (5 min)

  • Online evaluation metrics
  • A/B testing framework
  • Feedback collection
  • Retraining and model update pipeline
  • Cost and latency dashboards

Interview Tips for GenAI Design

Tip

What differentiates a pass from a strong-hire at Google: - Pass: Correct high-level architecture with RAG or fine-tuning - Strong hire: Deep discussion of inference optimization (PagedAttention, continuous batching), safety layering, evaluation methodology, and cost modeling

Common mistakes to avoid: 1. Treating LLM inference like traditional web service scaling (it's GPU-bound, not CPU-bound) 2. Ignoring safety and guardrails entirely 3. Not discussing how to evaluate generation quality 4. Assuming unlimited context windows solve all problems 5. Forgetting about cost — GPU inference is 100-1000x more expensive than traditional APIs 6. Not separating retrieval from generation in RAG systems 7. Ignoring multi-provider resilience (single provider = single point of failure) 8. Treating agents as simple chatbots (planning, memory, tool use are distinct subsystems)


How These Connect

GenAI/ML Fundamentals              GenAI System Design Questions
─────────────────────              ─────────────────────────────
Model Serving          ──────►     LLM Chatbot, Code Assistant, LLM Gateway
LLM Systems            ──────►     Enterprise RAG, Chatbot, AI Agents, Document Q&A
Feature Stores         ──────►     Content Moderation
Distributed Training   ──────►     ML Training Platform, Text-to-Image, Fine-Tuning Platform
Data Pipelines         ──────►     All GenAI Systems
LLM Evaluation         ──────►     LLM Evaluation Pipeline, Hallucination Detection
RLHF & Alignment       ──────►     Fine-Tuning Platform, Prompt Management

ML System Design                   GenAI System Design Questions
────────────────                   ─────────────────────────────
Image Search           ──────►     Multi-Modal Search, Document Q&A
Search Ranking         ──────►     Enterprise RAG (retrieval), Document Q&A
Fraud Detection        ──────►     Content Moderation (cascade), Hallucination Detection
Ads Ranking            ──────►     LLM Gateway (cost optimization)

Note

Master the GenAI/ML Fundamentals building blocks first, then apply them here in full end-to-end system designs.