GenAI/ML Fundamentals¶

Core infrastructure topics for Generative AI and Machine Learning — the building blocks that appear in every ML system design interview.

Why a Separate Section?¶

Just as load balancing, caching, and databases are fundamentals for software system design, topics like model serving, feature stores, and distributed training are fundamentals for ML/GenAI system design. You need to master these before tackling full ML system design questions.

Software System Fundamentals	GenAI/ML Fundamentals
Load balancing	Model serving
Caching	Feature stores
Databases	Vector databases
Data pipelines (ETL)	Data pipelines for ML
Networking	LLM APIs / RAG
Distributed systems	Distributed training

Note

These topics are particularly important for interviews at companies building AI-native products (OpenAI, Anthropic, Google DeepMind, Meta AI) and for ML Platform / MLOps roles at any company.

Recommended Study Order¶

Order	Topic	Time	Why This Order
1	Data Pipelines for ML	2-3 hours	Data comes first — can't train without it
2	Feature Stores	2-3 hours	Organize features for training and serving
3	Model Serving	3-4 hours	Get models into production
4	Distributed Training	3-4 hours	Scale training to large models
5	Large Language Models	4-5 hours	The most complex — builds on everything above
6	LLM Evaluation	2-3 hours	Benchmarks, LLM-as-judge, RAG eval, production metrics — needed for every GenAI design
7	RLHF & Alignment	3-4 hours	PPO, DPO, Constitutional AI, safety alignment — the most important GenAI topic after LLMs

What's Covered¶

Model Serving ¶

Infrastructure

Design production model serving — REST/batch inference, versioning, A/B testing, drift detection, and GPU auto-scaling.

Key concepts: TorchServe, Triton, dynamic batching, canary deployments, ONNX quantization, model registry

Difficulty: ⭐⭐⭐ Medium-Hard

Feature Stores ¶

Data Platform

Design a centralized feature management platform for training and serving consistency.

Key concepts: Train-serve skew, point-in-time joins, online/offline stores, Feast, feature versioning, stream materialization

Difficulty: ⭐⭐⭐ Medium-Hard

Data Pipelines for ML ¶

Data Engineering

Design end-to-end data pipelines for ML training — ingestion, transformation, validation, and orchestration.

Key concepts: Medallion architecture, Airflow DAGs, Kubeflow, data validation, dataset versioning, pipeline monitoring

Difficulty: ⭐⭐⭐ Medium-Hard

Large Language Models ¶

GenAI

Design production LLM systems — RAG, prompt engineering, fine-tuning, vector databases, and serving at scale.

Key concepts: RAG pipeline, chunking, vector DBs (Pinecone, Qdrant), LoRA fine-tuning, vLLM, guardrails, LLM evaluation

Difficulty: ⭐⭐⭐⭐ Hard

Distributed Training ¶

Training Infrastructure

Design training infrastructure that scales deep learning across hundreds of GPUs.

Key concepts: Data parallelism (DDP), ZeRO, tensor parallelism, pipeline parallelism, DeepSpeed, mixed precision, fault tolerance

Difficulty: ⭐⭐⭐⭐ Hard

LLM Evaluation ¶

Evaluation & Quality

Offline and online evaluation for LLMs — automatic metrics, LLM-as-judge, benchmarks (MMLU, Arena, HumanEval), RAGAS-style RAG metrics, production A/B and guardrails.

Key concepts: BLEU/ROUGE/BERTScore, human agreement, Elo/Arena, golden sets, faithfulness vs relevance, benchmark contamination

Difficulty: ⭐⭐⭐ Medium-Hard

RLHF & Alignment ¶

Alignment & Safety

NEW

The complete alignment pipeline — from SFT through reward modeling to PPO/DPO. Constitutional AI, safety alignment, preference data collection, and production alignment loops.

Key concepts: SFT, reward model (Bradley-Terry), PPO with KL penalty, DPO loss, IPO/KTO/ORPO variants, Constitutional AI, red teaming, alignment tax, online RLHF

Difficulty: ⭐⭐⭐⭐ Hard

Quick Reference¶

Topic	Focus	Key Challenge	Primary Language
Model Serving	Inference APIs	Latency, GPU utilization	Python
Feature Stores	Feature management	Train-serve consistency	Python
Data Pipelines	Data quality	Validation, lineage	Python
LLM Systems	GenAI applications	Hallucination, cost	Python
Distributed Training	Training at scale	Communication overhead	Python
LLM Evaluation	Quality & benchmarks	Subjective quality, RAG grounding	Python
RLHF & Alignment	Model alignment	Safety vs helpfulness trade-off	Python

How These Connect to System Design Questions¶

GenAI/ML Fundamentals              ML System Design Questions
─────────────────────              ─────────────────────────
Model Serving          ──────►     Image Caption Generator, all ML systems
Feature Stores         ──────►     Recommendation System, Fraud Detection, Ads Ranking
Data Pipelines         ──────►     Fraud Detection, Recommendation System

GenAI/ML Fundamentals              GenAI System Design Questions
─────────────────────              ─────────────────────────────
LLM Systems            ──────►     LLM Chatbot, Enterprise RAG, Code Assistant, AI Agents
LLM Evaluation         ──────►     All GenAI systems (offline gates + online KPIs)
RLHF & Alignment       ──────►     LLM Chatbot (safety), Content Moderation, Agents
Model Serving          ──────►     LLM Chatbot, Code Assistant, LLM Gateway
Distributed Training   ──────►     ML Training Platform, Text-to-Image
Feature Stores         ──────►     Content Moderation, Ads Ranking
Data Pipelines         ──────►     All GenAI Systems

Tip

Master the fundamentals first, then apply them in the ML System Design (10 designs) and GenAI System Design (10 designs with interview transcripts) sections.

GenAI/ML Fundamentals¶

Why a Separate Section?¶

Recommended Study Order¶

What's Covered¶

Model Serving¶

Feature Stores¶

Data Pipelines for ML¶

Large Language Models¶

Distributed Training¶

LLM Evaluation¶

RLHF & Alignment¶

Quick Reference¶

How These Connect to System Design Questions¶

Model Serving ¶

Feature Stores ¶

Data Pipelines for ML ¶

Large Language Models ¶

Distributed Training ¶

LLM Evaluation ¶

RLHF & Alignment ¶