Skip to content

Core Architectures

How modern sequence models work at the tensor level — from the original Transformer through GPT, BERT, T5, Mixture of Experts, and state-space models. This is the most interview-critical section of LLMBase.


Goals

After completing this section you will be able to:

  • Trace a token through every layer of a Transformer block and state the dimension at each step
  • Explain Multi-Head, Grouped-Query, and Multi-Query Attention with memory cost trade-offs
  • Compare sinusoidal, learned, RoPE, and ALiBi positional encodings and know when each is preferred
  • Implement a minimal GPT from scratch and explain the KV cache
  • Contrast BERT's masked language modeling with GPT's autoregressive objective
  • Describe T5's text-to-text framing and span corruption pre-training
  • Explain MoE routing, load balancing, and why sparse models scale better
  • Describe how Mamba's selective state spaces achieve linear-time sequence modeling

Topics

# Topic What You Will Learn
1 The Transformer Full architecture walkthrough, residual stream, Pre-Norm
2 Self-Attention and MHA Self vs cross attention, GQA, MQA, masking patterns
3 Positional Encoding Sinusoidal, Learned, RoPE, ALiBi, extrapolation
4 GPT (Decoder-Only) Causal attention, next-token prediction, KV cache, scaling
5 BERT (Encoder-Only) MLM, fine-tuning, embeddings, BERT variants
6 T5 (Encoder-Decoder) Text-to-text, span corruption, Flan-T5
7 Mixture of Experts Router design, load balancing, Mixtral, DeepSeek
8 State Space Models S4, Mamba, selective gating, hybrid architectures

Every page includes plain-English math walkthroughs, worked numerical examples, runnable Python code, and FAANG-level interview questions with expected answer depth.