ML System Design¶
Design machine learning systems for production — 10 designs covering ranking, retrieval, personalization, NLP, speech, and ML infrastructure.
What Makes ML Design Different?¶
ML system design interviews focus on the full ML lifecycle, not just the model:
| Stage | Key Considerations |
|---|---|
| Data | Collection, storage, labeling, versioning |
| Training | Distributed training, experiment tracking |
| Serving | Latency, throughput, model updates |
| Monitoring | Drift detection, performance metrics |
Warning
A common mistake: focusing only on the model architecture. Interviewers want to see you design the entire system around it.
Recommended Study Order¶
Tip
Follow this order — each design introduces new concepts that build on earlier ones.
| Order | Design | New Concepts Introduced | Prerequisite |
|---|---|---|---|
| 1 | Image Caption Generator | Encoder-decoder, GPU serving, Triton | Model Serving fundamentals |
| 2 | Image Search | Embeddings, vector DBs, ANN indexes | Image Caption (embeddings) |
| 3 | Recommendation System | Two-Tower, collaborative filtering, cold start | Image Search (ANN) |
| 4 | Search Ranking | BM25, LambdaMART, retrieval + ranking | Recommendation (two-stage) |
| 5 | Fraud Detection | Real-time features, class imbalance, ensembles | Search Ranking (ranking) |
| 6 | Real-time Personalization | Session models, bandits, exploration | Recommendation + Fraud |
| 7 | Ads Ranking System | CTR prediction, auctions, budget pacing | All of the above |
| 8 | Real-time Feature Platform | Streaming features, PIT joins, drift | Infra for all above |
| 9 | Machine Translation | Transformer NMT, multilingual, QE, low-resource | Seq2Seq fundamentals |
| 10 | Speech Recognition | CTC/RNN-T, streaming ASR, diarization | Audio processing basics |
Available Designs¶
Image Caption Generator¶
Computer Vision
Design a system that generates descriptive captions for images using deep learning.
Key concepts: Encoder-decoder architecture, attention mechanisms, model serving (Triton), batching, caching, GPU optimization
Difficulty: ⭐⭐⭐ Medium-Hard
Image Search System¶
Vector Search
Design a visual search system for finding similar images or searching by text description.
Key concepts: CLIP embeddings, vector databases (FAISS, Pinecone), ANN indexes (IVF, HNSW), multi-modal search, indexing pipelines, re-ranking
Difficulty: ⭐⭐⭐ Medium-Hard
Recommendation System¶
Personalization
Design a recommendation system for e-commerce or content platforms like Netflix/Amazon.
Key concepts: Collaborative filtering, content-based filtering, Two-Tower models, ANN retrieval (FAISS), ranking models, cold start, A/B testing
Difficulty: ⭐⭐⭐⭐ Hard
Search Ranking¶
Information Retrieval
Design an ML-powered search ranking system (learning-to-rank, retrieval + ranking + re-ranking).
Key concepts: BM25, dense retrieval, hybrid fusion, LambdaMART, cross-encoder re-ranking, position bias, NDCG, serving at scale
Difficulty: ⭐⭐⭐⭐ Hard
Real-time Fraud Detection¶
Low-latency ML
Design a system that detects fraudulent transactions in real-time (<100ms).
Key concepts: Feature engineering (velocity features), class imbalance, ensemble models, rules engine, decision thresholds, case management, drift detection
Difficulty: ⭐⭐⭐⭐ Hard
Real-time Personalization¶
Session-Based ML
Design a real-time personalization system that adapts to user behavior within a session.
Key concepts: Session models (GRU4Rec, SASRec), contextual bandits (Thompson Sampling, LinUCB), real-time feature engineering, multi-task ranking (MMoE), exploration vs exploitation, drift detection
Difficulty: ⭐⭐⭐⭐ Hard
Ads Ranking System¶
Revenue ML
NEW
Design an ads ranking system — the core revenue engine at Google, Meta, Amazon. Predict CTR/CVR, run auctions, manage advertiser budgets.
Key concepts: CTR prediction (DCN/DLRM), second-price/GSP auctions, budget pacing, position bias correction, calibration, exploration for new ads, near-real-time training
Difficulty: ⭐⭐⭐⭐⭐ Very Hard
Real-time Feature Platform¶
ML Infrastructure
NEW
Design a real-time feature platform that computes, stores, and serves ML features with sub-millisecond latency — solving train-serve skew, point-in-time joins, and feature freshness at scale.
Key concepts: Batch vs streaming vs on-demand features, train-serve consistency, point-in-time joins, sliding window aggregations, feature drift monitoring, online/offline store architecture
Difficulty: ⭐⭐⭐⭐ Hard
Machine Translation¶
NLP
NEW
Design a machine translation system like Google Translate — 100+ languages, text/image/speech, quality estimation, and low-resource language support.
Key concepts: Transformer encoder-decoder, multilingual NMT, BPE/SentencePiece, quality estimation, back-translation, pivot languages, beam search, model distillation for serving
Difficulty: ⭐⭐⭐⭐ Hard
Speech Recognition¶
Audio/Speech
NEW
Design a speech recognition (ASR) system like Google Speech-to-Text or Whisper — real-time streaming, speaker diarization, 100+ languages.
Key concepts: Mel spectrograms, CTC/RNN-T, Conformer, streaming inference, speaker diarization, language model fusion, SpecAugment, on-device vs cloud deployment
Difficulty: ⭐⭐⭐⭐ Hard
ML System Design Framework¶
Use this framework in your interviews:
1. Problem Setup (5 min)¶
- What are we predicting/generating?
- What data is available?
- What are the latency requirements?
- Online vs batch prediction?
- Success metrics (business + ML)
2. Data Pipeline (10 min)¶
- How is data collected and stored?
- Feature engineering approach
- Data validation and quality checks
- Training/serving data consistency
- Labeling strategy
3. Model Architecture (10 min)¶
- Model selection and justification
- Training strategy (pre-trained, fine-tuning)
- Evaluation metrics (offline)
- Handling edge cases (cold start, imbalance)
4. Serving Infrastructure (10 min)¶
- Model serving framework (TensorFlow Serving, TorchServe, Triton)
- Latency optimization (batching, caching, quantization)
- Scaling strategy (horizontal, GPU)
- A/B testing and gradual rollout
5. Monitoring & Iteration (5 min)¶
- Model performance monitoring
- Data drift detection
- Retraining triggers and pipelines
- Feedback loops
Key ML Concepts to Know¶
MODEL SERVING
├── Batch Inference → Process large datasets offline
├── Online Inference → Real-time predictions, low latency
├── Model Versioning → Track and rollback model versions
├── Dynamic Batching → Group requests for GPU efficiency
└── Shadow Mode → Test new models without affecting users
RETRIEVAL & RANKING
├── Two-Stage Pipeline → Retrieval (fast, broad) + Ranking (accurate)
├── Vector Indexes → FAISS, HNSW, IVF for ANN search
├── Feature Stores → Consistent features for training/serving
└── Re-ranking → Apply business rules, diversity
FEATURE ENGINEERING
├── Real-time Features → Computed on request (velocity, session)
├── Batch Features → Precomputed (user history, aggregates)
├── Feature Store → Feast, Tecton for consistency
└── Embeddings → Dense representations from neural nets
MONITORING
├── Data Drift → Input distribution changes
├── Concept Drift → Relationship between input/output changes
├── Model Degradation → Performance decline over time
├── Latency Metrics → P50, P95, P99 response times
└── Business Metrics → CTR, conversion, revenue impact
Tip
For deep dives on Model Serving, Feature Stores, Data Pipelines, LLMs, and Distributed Training, see the GenAI/ML Fundamentals section.
Pattern Recognition¶
| Pattern | Where You'll See It |
|---|---|
| Two-stage retrieval+ranking | Recommendations, Search, Ads Ranking, Fraud Detection |
| Vector embeddings | Image Search, Recommendations, Ads (DLRM) |
| Feature stores | Fraud Detection, Recommendations, Ads Ranking |
| Dynamic batching | Image Captioning, all GPU-based serving |
| A/B testing | All ML systems |
| Ensemble models | Fraud Detection, Recommendations |
| Rules + ML hybrid | Fraud Detection, Content Moderation, Ads |
| Real-time aggregations | Fraud Detection (velocity), Personalization, Ads |
| Session modeling | Real-time Personalization, Recommendations |
| Multi-armed bandits | Personalization, Ads (new ad exploration) |
| Auction mechanics | Ads Ranking (unique to ads) |
| Encoder-decoder | Machine Translation, Image Captioning, Speech Recognition |
| Beam search/decoding | Machine Translation, Speech Recognition |
| Streaming inference | Speech Recognition, Real-time Personalization |
Note
Master these patterns and you can apply them to any new ML system design problem.
Quick Reference: System Comparison¶
| System | Latency | Key Challenge | Primary Metric |
|---|---|---|---|
| Image Captioning | ~500ms | Model optimization | BLEU, CIDEr |
| Recommendations | <100ms | Cold start, scale | CTR, Conversion |
| Fraud Detection | <100ms | Class imbalance | Precision-Recall |
| Image Search | <200ms | Index at scale | Recall@K, Latency |
| Search Ranking | <200ms | Retrieval + rank budgets | NDCG@K, CTR |
| Personalization | <50ms | Session modeling, cold start | CTR, Session Depth |
| Ads Ranking | <50ms | Revenue × relevance | AUC, Revenue, Calibration |
| Feature Platform | <5ms (serving) | Train-serve consistency | Feature freshness, Drift rate |
| Machine Translation | <200ms | Low-resource, quality | BLEU, Human eval |
| Speech Recognition | <300ms RTF | Noise, streaming | WER, Latency |
What's Next?¶
After mastering ML system design:
- Go deeper on GenAI with GenAI System Design — 10 LLM/GenAI systems with interview transcripts
- Review fundamentals in GenAI/ML Fundamentals — 7 building blocks including LLM Evaluation and RLHF & Alignment
- Practice with transcripts — the GenAI section includes full hypothetical interview walkthroughs
Note
Looking for LLM Chatbot, RAG, Code Assistant, AI Agents, or Text-to-Image designs? These GenAI-specific system designs live in the dedicated GenAI System Design section.