Skip to content

Inference and Serving

The engineering that determines latency, throughput, and cost when deploying LLMs in production. From decoding algorithms to KV cache optimization, speculative decoding, continuous batching, and production serving systems.


Goals

After completing this section you will be able to:

  • Implement greedy, beam search, top-k, top-p, and temperature decoding from scratch
  • Calculate KV cache memory requirements for any model configuration
  • Explain speculative decoding and prove it produces the same distribution as the target model
  • Design a continuous batching system with PagedAttention for memory efficiency
  • Choose the right serving framework for a given deployment scenario

Topics

# Topic What You Will Learn
1 Decoding Strategies Greedy, beam search, top-k, top-p, temperature, repetition penalty
2 KV Cache Memory layout, GQA savings, PagedAttention, compression
3 Speculative Decoding Draft-verify paradigm, acceptance probability, speedup analysis
4 Continuous Batching Iteration-level scheduling, PagedAttention, prefix caching
5 Quantization for Inference GPTQ, AWQ, GGUF, llama.cpp, quality benchmarks
6 LLM Serving Systems vLLM, TGI, TensorRT-LLM, Ollama, deployment patterns

Every page includes plain-English math walkthroughs, worked numerical examples, runnable Python code, and FAANG-level interview questions.