Core Architectures¶
How modern sequence models work at the tensor level — from the original Transformer through GPT, BERT, T5, Mixture of Experts, and state-space models. This is the most interview-critical section of LLMBase.
Goals¶
After completing this section you will be able to:
- Trace a token through every layer of a Transformer block and state the dimension at each step
- Explain Multi-Head, Grouped-Query, and Multi-Query Attention with memory cost trade-offs
- Compare sinusoidal, learned, RoPE, and ALiBi positional encodings and know when each is preferred
- Implement a minimal GPT from scratch and explain the KV cache
- Contrast BERT's masked language modeling with GPT's autoregressive objective
- Describe T5's text-to-text framing and span corruption pre-training
- Explain MoE routing, load balancing, and why sparse models scale better
- Describe how Mamba's selective state spaces achieve linear-time sequence modeling
Topics¶
| # | Topic | What You Will Learn |
|---|---|---|
| 1 | The Transformer | Full architecture walkthrough, residual stream, Pre-Norm |
| 2 | Self-Attention and MHA | Self vs cross attention, GQA, MQA, masking patterns |
| 3 | Positional Encoding | Sinusoidal, Learned, RoPE, ALiBi, extrapolation |
| 4 | GPT (Decoder-Only) | Causal attention, next-token prediction, KV cache, scaling |
| 5 | BERT (Encoder-Only) | MLM, fine-tuning, embeddings, BERT variants |
| 6 | T5 (Encoder-Decoder) | Text-to-text, span corruption, Flan-T5 |
| 7 | Mixture of Experts | Router design, load balancing, Mixtral, DeepSeek |
| 8 | State Space Models | S4, Mamba, selective gating, hybrid architectures |
Every page includes plain-English math walkthroughs, worked numerical examples, runnable Python code, and FAANG-level interview questions with expected answer depth.