GenAI/ML Fundamentals¶
Core infrastructure topics for Generative AI and Machine Learning — the building blocks that appear in every ML system design interview.
Why a Separate Section?¶
Just as load balancing, caching, and databases are fundamentals for software system design, topics like model serving, feature stores, and distributed training are fundamentals for ML/GenAI system design. You need to master these before tackling full ML system design questions.
| Software System Fundamentals | GenAI/ML Fundamentals |
|---|---|
| Load balancing | Model serving |
| Caching | Feature stores |
| Databases | Vector databases |
| Data pipelines (ETL) | Data pipelines for ML |
| Networking | LLM APIs / RAG |
| Distributed systems | Distributed training |
Note
These topics are particularly important for interviews at companies building AI-native products (OpenAI, Anthropic, Google DeepMind, Meta AI) and for ML Platform / MLOps roles at any company.
Recommended Study Order¶
| Order | Topic | Time | Why This Order |
|---|---|---|---|
| 1 | Data Pipelines for ML | 2-3 hours | Data comes first — can't train without it |
| 2 | Feature Stores | 2-3 hours | Organize features for training and serving |
| 3 | Model Serving | 3-4 hours | Get models into production |
| 4 | Distributed Training | 3-4 hours | Scale training to large models |
| 5 | Large Language Models | 4-5 hours | The most complex — builds on everything above |
| 6 | LLM Evaluation | 2-3 hours | Benchmarks, LLM-as-judge, RAG eval, production metrics — needed for every GenAI design |
| 7 | RLHF & Alignment | 3-4 hours | PPO, DPO, Constitutional AI, safety alignment — the most important GenAI topic after LLMs |
What's Covered¶
Model Serving¶
Infrastructure
Design production model serving — REST/batch inference, versioning, A/B testing, drift detection, and GPU auto-scaling.
Key concepts: TorchServe, Triton, dynamic batching, canary deployments, ONNX quantization, model registry
Difficulty: ⭐⭐⭐ Medium-Hard
Feature Stores¶
Data Platform
Design a centralized feature management platform for training and serving consistency.
Key concepts: Train-serve skew, point-in-time joins, online/offline stores, Feast, feature versioning, stream materialization
Difficulty: ⭐⭐⭐ Medium-Hard
Data Pipelines for ML¶
Data Engineering
Design end-to-end data pipelines for ML training — ingestion, transformation, validation, and orchestration.
Key concepts: Medallion architecture, Airflow DAGs, Kubeflow, data validation, dataset versioning, pipeline monitoring
Difficulty: ⭐⭐⭐ Medium-Hard
Large Language Models¶
GenAI
Design production LLM systems — RAG, prompt engineering, fine-tuning, vector databases, and serving at scale.
Key concepts: RAG pipeline, chunking, vector DBs (Pinecone, Qdrant), LoRA fine-tuning, vLLM, guardrails, LLM evaluation
Difficulty: ⭐⭐⭐⭐ Hard
Distributed Training¶
Training Infrastructure
Design training infrastructure that scales deep learning across hundreds of GPUs.
Key concepts: Data parallelism (DDP), ZeRO, tensor parallelism, pipeline parallelism, DeepSpeed, mixed precision, fault tolerance
Difficulty: ⭐⭐⭐⭐ Hard
LLM Evaluation¶
Evaluation & Quality
Offline and online evaluation for LLMs — automatic metrics, LLM-as-judge, benchmarks (MMLU, Arena, HumanEval), RAGAS-style RAG metrics, production A/B and guardrails.
Key concepts: BLEU/ROUGE/BERTScore, human agreement, Elo/Arena, golden sets, faithfulness vs relevance, benchmark contamination
Difficulty: ⭐⭐⭐ Medium-Hard
RLHF & Alignment¶
Alignment & Safety
NEW
The complete alignment pipeline — from SFT through reward modeling to PPO/DPO. Constitutional AI, safety alignment, preference data collection, and production alignment loops.
Key concepts: SFT, reward model (Bradley-Terry), PPO with KL penalty, DPO loss, IPO/KTO/ORPO variants, Constitutional AI, red teaming, alignment tax, online RLHF
Difficulty: ⭐⭐⭐⭐ Hard
Quick Reference¶
| Topic | Focus | Key Challenge | Primary Language |
|---|---|---|---|
| Model Serving | Inference APIs | Latency, GPU utilization | Python |
| Feature Stores | Feature management | Train-serve consistency | Python |
| Data Pipelines | Data quality | Validation, lineage | Python |
| LLM Systems | GenAI applications | Hallucination, cost | Python |
| Distributed Training | Training at scale | Communication overhead | Python |
| LLM Evaluation | Quality & benchmarks | Subjective quality, RAG grounding | Python |
| RLHF & Alignment | Model alignment | Safety vs helpfulness trade-off | Python |
How These Connect to System Design Questions¶
GenAI/ML Fundamentals ML System Design Questions
───────────────────── ─────────────────────────
Model Serving ──────► Image Caption Generator, all ML systems
Feature Stores ──────► Recommendation System, Fraud Detection, Ads Ranking
Data Pipelines ──────► Fraud Detection, Recommendation System
GenAI/ML Fundamentals GenAI System Design Questions
───────────────────── ─────────────────────────────
LLM Systems ──────► LLM Chatbot, Enterprise RAG, Code Assistant, AI Agents
LLM Evaluation ──────► All GenAI systems (offline gates + online KPIs)
RLHF & Alignment ──────► LLM Chatbot (safety), Content Moderation, Agents
Model Serving ──────► LLM Chatbot, Code Assistant, LLM Gateway
Distributed Training ──────► ML Training Platform, Text-to-Image
Feature Stores ──────► Content Moderation, Ads Ranking
Data Pipelines ──────► All GenAI Systems
Tip
Master the fundamentals first, then apply them in the ML System Design (10 designs) and GenAI System Design (10 designs with interview transcripts) sections.