GPT-2: Language Models are Unsupervised Multitask Learners¶
Authors: Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever
Year: 2019 | Venue: Technical report (OpenAI)
Link: OpenAI Blog
TL;DR¶
GPT-2 shows that a single Transformer language model trained on WebText (40GB of web text curated via Reddit upvotes) can transfer to many tasks without fine-tuning, simply via prompting — formatting the input so the desired output is the natural continuation. The paper scales from 117M to 1.5B parameters using byte-level BPE tokenization, demonstrating that zero-shot task transfer emerges from language modeling alone.
Why This Paper Matters¶
GPT-2 is the bridge between "language models just predict the next word" and "language models can do tasks." It popularized:
- Decoder-only architectures as the default for generative LMs
- BPE tokenization (byte-level) as the standard vocabulary approach
- Prompting as a task interface — no task-specific heads needed
- The insight that scale + data can replace supervised training for many tasks
Every chat model you use today (ChatGPT, Claude, Gemini) descends from this lineage.
Key Concepts Explained Simply¶
Autoregressive Language Modeling¶
The model predicts one token at a time, left to right. At each step, it only sees previous tokens — never future ones. This is enforced by causal masking in the attention layers.
Think of it as completing a sentence: "The weather today is..." → the model assigns probabilities to every possible next token ("sunny", "rainy", "cold", etc.) and samples one.
Zero-Shot Task Transfer¶
Instead of fine-tuning on labeled data, you describe the task in natural language as a prompt. The model's next-token prediction then "solves" the task:
- Translation: "Translate to French: Hello → Bonjour. Translate to French: Goodbye →"
- Summarization: "Article: [long text]. TL;DR:"
- QA: "Question: What is the capital of France? Answer:"
The key insight: if the training data contains examples of these patterns, the model learns them implicitly.
Byte-Level BPE¶
Traditional word-level vocabularies can't handle rare words, misspellings, or new terms. BPE builds a vocabulary of subword units by iteratively merging the most frequent byte pairs. Byte-level BPE starts from individual bytes (256 possible), so it can represent any text — no unknown tokens ever.
The Math — Explained Step by Step¶
Autoregressive Factorization¶
Breaking it down:
The joint probability of a sequence is decomposed into a product of conditional probabilities. Each token's probability depends on all previous tokens. This is exact — no approximation or independence assumption.
Training Objective (Negative Log-Likelihood)¶
Minimize the negative log-likelihood of the training data. Lower loss = the model assigns higher probability to the actual next token in the training corpus.
Perplexity¶
Perplexity is the exponentiated loss — interpretable as "how many tokens is the model choosing between on average." A perplexity of 20 means the model is as uncertain as if it were choosing uniformly among 20 tokens at each step.
Python Implementation¶
import numpy as np
def stable_softmax(logits):
z = logits - np.max(logits, axis=-1, keepdims=True)
e = np.exp(z)
return e / np.sum(e, axis=-1, keepdims=True)
def causal_mask(seq_len):
"""Lower-triangular mask: position i can attend to positions 0..i."""
return np.tril(np.ones((seq_len, seq_len)))
def causal_self_attention(x, W_q, W_k, W_v, W_o):
"""Single-head causal self-attention (GPT-style)."""
seq_len, d = x.shape
Q, K, V = x @ W_q, x @ W_k, x @ W_v
scores = (Q @ K.T) / np.sqrt(d)
mask = causal_mask(seq_len)
scores = np.where(mask == 0, -1e9, scores)
attn = stable_softmax(scores)
return (attn @ V) @ W_o
def gpt2_block(x, W_q, W_k, W_v, W_o, W1, b1, W2, b2, g1, g2):
"""Simplified GPT-2 block: LN -> Attention -> Residual -> LN -> FFN -> Residual."""
# Pre-norm attention (GPT-2 uses pre-LN, unlike the original Transformer)
h = layer_norm(x, g1)
h = causal_self_attention(h, W_q, W_k, W_v, W_o)
x = x + h
# Pre-norm FFN with GELU
h = layer_norm(x, g2)
h = gelu(h @ W1 + b1) @ W2 + b2
x = x + h
return x
def layer_norm(x, gamma, eps=1e-6):
mean = np.mean(x, axis=-1, keepdims=True)
var = np.var(x, axis=-1, keepdims=True)
return gamma * (x - mean) / np.sqrt(var + eps)
def gelu(x):
"""Gaussian Error Linear Unit — smooth ReLU used in GPT-2."""
return 0.5 * x * (1 + np.tanh(np.sqrt(2 / np.pi) * (x + 0.044715 * x**3)))
def compute_nll(logits, target_ids):
"""
Compute per-token negative log-likelihood.
logits: [seq_len, vocab_size] — model output logits
target_ids: [seq_len] — next token at each position
"""
probs = stable_softmax(logits)
nll = 0.0
for t in range(len(target_ids)):
nll -= np.log(probs[t, target_ids[t]] + 1e-12)
return nll / len(target_ids)
def perplexity(logits, target_ids):
return np.exp(compute_nll(logits, target_ids))
def top_k_sampling(logits, k=10, temperature=1.0):
"""Sample from the top-k logits with temperature."""
logits = logits / temperature
top_indices = np.argsort(logits)[-k:]
top_logits = logits[top_indices]
probs = stable_softmax(top_logits)
chosen = np.random.choice(top_indices, p=probs)
return chosen
def nucleus_sampling(logits, p=0.9, temperature=1.0):
"""Top-p (nucleus) sampling."""
logits = logits / temperature
probs = stable_softmax(logits)
sorted_idx = np.argsort(-probs)
sorted_probs = probs[sorted_idx]
cumsum = np.cumsum(sorted_probs)
cutoff = np.searchsorted(cumsum, p) + 1
allowed = sorted_idx[:cutoff]
allowed_probs = probs[allowed]
allowed_probs = allowed_probs / allowed_probs.sum()
return np.random.choice(allowed, p=allowed_probs)
# --- Demo ---
if __name__ == "__main__":
np.random.seed(42)
seq_len, vocab_size = 8, 100
fake_logits = np.random.randn(seq_len, vocab_size)
target_ids = np.random.randint(0, vocab_size, seq_len)
nll = compute_nll(fake_logits, target_ids)
ppl = perplexity(fake_logits, target_ids)
print(f"NLL: {nll:.4f}")
print(f"Perplexity: {ppl:.2f}")
last_logits = fake_logits[-1]
print(f"\nTop-k sample: {top_k_sampling(last_logits, k=5)}")
print(f"Nucleus sample: {nucleus_sampling(last_logits, p=0.9)}")
Interview Importance¶
GPT-2 is essential for understanding decoder-only LMs, prompting, and sampling strategies. Most interview questions about language model fundamentals trace to concepts this paper introduced.
Difficulty Level: ⭐⭐ (Medium)¶
Interview Questions & Answers¶
Q1: Define zero-shot vs. few-shot in the GPT family.¶
Answer: - Zero-shot: The model receives only a task description/prompt with no examples. "Translate English to French: Hello →" - Few-shot: The model receives a few input-output examples in the prompt before the query. GPT-2 is primarily a zero-shot model (it was evaluated zero-shot). GPT-3 later demonstrated strong few-shot ability by including examples in the context. - Key difference from fine-tuning: No gradient updates happen. The model uses the prompt examples purely as context for its next-token predictions.
Q2: Why does byte-level BPE help with rare words and multiple languages?¶
Answer: Byte-level BPE starts from raw bytes (256 base units) rather than Unicode characters or words. Any text can be represented — no [UNK] tokens. For rare words like "pneumonoultramicroscopicsilicovolcanoconiosis," BPE breaks them into known subword pieces. For non-Latin scripts (Chinese, Arabic, emoji), byte-level encoding handles them without language-specific preprocessing.
Q3: Name a failure mode of relying on zero-shot transfer for production.¶
Answer: Zero-shot performance is unreliable and unpredictable. Small prompt changes can dramatically shift output quality. The model may: - Generate plausible-sounding but factually wrong text - Fail to follow formatting instructions consistently - Produce biased or harmful content (no alignment training) - Perform well on benchmarks but poorly on domain-specific tasks
This is why production systems use fine-tuning (SFT), RLHF, or RAG rather than pure zero-shot prompting.
Q4: What is the difference between GPT-2's pre-LN and the original Transformer's post-LN?¶
Answer: The original Transformer applies LayerNorm after the residual connection: LN(x + sublayer(x)). GPT-2 applies LayerNorm before the sublayer: x + sublayer(LN(x)). Pre-LN is more stable during training because gradients flow through the residual path without being modified by normalization. This allowed training deeper models without careful learning rate warmup.
Q5: Explain temperature, top-k, and top-p sampling.¶
Answer: - Temperature \(\tau\): Divides logits by \(\tau\) before softmax. \(\tau < 1\) sharpens the distribution (more deterministic), \(\tau > 1\) flattens it (more random). - Top-k: Only consider the \(k\) highest-probability tokens, renormalize, then sample. Prevents sampling from the long tail of unlikely tokens. - Top-p (nucleus): Keep the smallest set of tokens whose cumulative probability exceeds \(p\). Adapts the number of candidates dynamically — for peaked distributions, only a few tokens are considered; for flat distributions, many are.
Connections to Other Papers¶
- Transformer → GPT-2 uses the decoder stack with causal masking
- GPT-3 → Scales GPT-2 to 175B, adds few-shot in-context learning
- InstructGPT → Aligns GPT-3 with human preferences via RLHF
- BERT → Contrasting approach: bidirectional encoder, MLM objective
- LLaMA → Modern open-weight GPT-style model with RMSNorm, SwiGLU, RoPE
Key Takeaways for Quick Review¶
| Concept | Remember |
|---|---|
| Architecture | Decoder-only Transformer with causal masking |
| Pre-training | Next-token prediction (autoregressive NLL) |
| Key innovation | Zero-shot task transfer via prompting |
| Tokenizer | Byte-level BPE (no unknown tokens) |
| Norm placement | Pre-LN (before sublayer, not after) |
| Activation | GELU (smooth ReLU) |
| Sizes | 117M, 345M, 762M, 1.5B parameters |