Foundations of Language Modeling¶

The mathematical and conceptual bedrock that existed before the Transformer. Understanding these topics deeply is what separates candidates who can explain why things work from those who only know that they work.

What Changed in This Section

Every page now includes everyday analogies, "Think of it like..." callouts, and simplified introductions before diving into the math. The mathematical content is unchanged — we've added layers of explanation on top so that anyone with high school math can follow along. If a formula looks intimidating, read the paragraph right above it first.

Before You Start¶

Prerequisites¶

This section builds on the Deep Learning Fundamentals (Part 0). You should be comfortable with:

Perceptrons, MLPs, and forward passes — see The Perceptron and Feedforward Networks
Activation functions (sigmoid, tanh, ReLU, softmax) — see Activation Functions
Backpropagation and gradient descent — see Backpropagation and Gradient Descent
Cross-entropy loss — see Loss Functions and Regularization
Sequence modeling and RNN basics — see Sequence Modeling and RNNs

Mathematics used in this section: - Conditional probability and the chain rule — reviewed in Math Prerequisites - Vector notation \(\mathbf{v} \in \mathbb{R}^d\) — explained in Math Prerequisites - Logarithms and exponentials — reviewed in Math Prerequisites - Summation and product notation — explained in Math Prerequisites

Reading Strategy¶

This section progresses from classical language models (n-grams) → distributional semantics (embeddings) → recurrent networks → attention. If you're new to language modeling:

First Pass (Build Intuition): - Focus on language_modeling_basics.md and word_embeddings.md — these introduce core concepts - Read neural_language_models.md for LSTM/GRU gate intuition (skip spectral norm discussion) - Read sequence_to_sequence.md for the encoder-decoder pattern and attention motivation - Read information_theory.md for entropy, cross-entropy, and perplexity (skip forward/reverse KL deep dive) - Read attention_math.md for scaled dot-product attention derivation and the 4×4 worked example

Second Pass (Deepen Understanding): - Re-read with the "Deep Dive" sections included - Study GloVe matrix factorization, PMI, and noise contrastive estimation - Work through the full LSTM numerical trace - Study KL divergence mode-covering vs mode-seeking behavior - Analyze attention complexity (MAC counts) and multi-head projections

What to Skip on First Reading

Deep dives on PMI, noise contrastive estimation, and GloVe factorization (word_embeddings.md)
Spectral norm and Jacobian analysis in vanishing gradients (neural_language_models.md)
Scheduled sampling deep dive (sequence_to_sequence.md)
Forward vs reverse KL divergence, DPO, and Bradley-Terry model (information_theory.md)
FlashAttention and low-rank kernel approximation (attention_math.md)

Goals¶

After completing this section you will be able to:

Derive the chain rule decomposition of language models and explain the Markov assumption
Explain how Word2Vec, GloVe, and FastText learn vector representations of meaning
Trace data through an LSTM cell gate by gate with actual numbers
Describe the encoder-decoder framework and how attention solves the bottleneck problem
Connect entropy, cross-entropy, and KL divergence to LLM training objectives
Derive scaled dot-product attention from first principles and explain why we scale by \(\sqrt{d_k}\)

Topics¶

#	Topic	What You Will Learn
1	Language Modeling	N-grams, chain rule, smoothing, perplexity
2	Word Embeddings	Word2Vec, GloVe, FastText, embedding arithmetic
3	Neural Language Models	RNNs, vanishing gradients, LSTM gates, GRU
4	Sequence-to-Sequence	Encoder-decoder, Bahdanau attention, teacher forcing
5	Information Theory	Entropy, cross-entropy, KL divergence, perplexity
6	Attention Mathematics	Scaled dot-product attention, multi-head attention, masking

Hands-On Notebooks¶

Practice with interactive Jupyter notebooks — each combines toy examples (build from scratch with NumPy/PyTorch) with real-world usage (HuggingFace transformers, gensim):

Notebook	Covers Topics
Language Modeling & Embeddings	Language Modeling, Word Embeddings
Neural LM & Seq2Seq	Neural Language Models, Sequence-to-Sequence
Information Theory & Attention	Information Theory, Attention Mathematics

Every page includes:

Simple analogies — everyday comparisons to build intuition before any math
"In Plain English" — what each equation means in words
Worked examples — step-by-step calculations with real numbers
Runnable Python code — so you can verify the math yourself
FAANG-level interview questions — with expected answer depth

Reading Order for Beginners

If you're encountering these topics for the first time, we recommend reading in order (1 → 6). Each page builds on concepts from the previous one. The "Think of it like..." boxes at the start of each section give you the intuition before the math arrives.