Inference and Serving¶
The engineering that determines latency, throughput, and cost when deploying LLMs in production. From decoding algorithms to KV cache optimization, speculative decoding, continuous batching, and production serving systems.
Goals¶
After completing this section you will be able to:
- Implement greedy, beam search, top-k, top-p, and temperature decoding from scratch
- Calculate KV cache memory requirements for any model configuration
- Explain speculative decoding and prove it produces the same distribution as the target model
- Design a continuous batching system with PagedAttention for memory efficiency
- Choose the right serving framework for a given deployment scenario
Topics¶
| # | Topic | What You Will Learn |
|---|---|---|
| 1 | Decoding Strategies | Greedy, beam search, top-k, top-p, temperature, repetition penalty |
| 2 | KV Cache | Memory layout, GQA savings, PagedAttention, compression |
| 3 | Speculative Decoding | Draft-verify paradigm, acceptance probability, speedup analysis |
| 4 | Continuous Batching | Iteration-level scheduling, PagedAttention, prefix caching |
| 5 | Quantization for Inference | GPTQ, AWQ, GGUF, llama.cpp, quality benchmarks |
| 6 | LLM Serving Systems | vLLM, TGI, TensorRT-LLM, Ollama, deployment patterns |
Every page includes plain-English math walkthroughs, worked numerical examples, runnable Python code, and FAANG-level interview questions.