Skip to content

GPT-2: Language Models are Unsupervised Multitask Learners

Authors: Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever
Year: 2019  |  Venue: Technical report (OpenAI)
Link: OpenAI Blog


TL;DR

GPT-2 shows that a single Transformer language model trained on WebText (40GB of web text curated via Reddit upvotes) can transfer to many tasks without fine-tuning, simply via prompting — formatting the input so the desired output is the natural continuation. The paper scales from 117M to 1.5B parameters using byte-level BPE tokenization, demonstrating that zero-shot task transfer emerges from language modeling alone.


Why This Paper Matters

GPT-2 is the bridge between "language models just predict the next word" and "language models can do tasks." It popularized:

  1. Decoder-only architectures as the default for generative LMs
  2. BPE tokenization (byte-level) as the standard vocabulary approach
  3. Prompting as a task interface — no task-specific heads needed
  4. The insight that scale + data can replace supervised training for many tasks

Every chat model you use today (ChatGPT, Claude, Gemini) descends from this lineage.


Key Concepts Explained Simply

Autoregressive Language Modeling

The model predicts one token at a time, left to right. At each step, it only sees previous tokens — never future ones. This is enforced by causal masking in the attention layers.

Think of it as completing a sentence: "The weather today is..." → the model assigns probabilities to every possible next token ("sunny", "rainy", "cold", etc.) and samples one.

Zero-Shot Task Transfer

Instead of fine-tuning on labeled data, you describe the task in natural language as a prompt. The model's next-token prediction then "solves" the task:

  • Translation: "Translate to French: Hello → Bonjour. Translate to French: Goodbye →"
  • Summarization: "Article: [long text]. TL;DR:"
  • QA: "Question: What is the capital of France? Answer:"

The key insight: if the training data contains examples of these patterns, the model learns them implicitly.

Byte-Level BPE

Traditional word-level vocabularies can't handle rare words, misspellings, or new terms. BPE builds a vocabulary of subword units by iteratively merging the most frequent byte pairs. Byte-level BPE starts from individual bytes (256 possible), so it can represent any text — no unknown tokens ever.


The Math — Explained Step by Step

Autoregressive Factorization

\[ P_\theta(x_1, x_2, \ldots, x_n) = \prod_{i=1}^{n} P_\theta(x_i \mid x_1, x_2, \ldots, x_{i-1}) \]

Breaking it down:

The joint probability of a sequence is decomposed into a product of conditional probabilities. Each token's probability depends on all previous tokens. This is exact — no approximation or independence assumption.

Training Objective (Negative Log-Likelihood)

\[ \mathcal{L} = -\frac{1}{n} \sum_{i=1}^{n} \log P_\theta(x_i \mid x_{<i}) \]

Minimize the negative log-likelihood of the training data. Lower loss = the model assigns higher probability to the actual next token in the training corpus.

Perplexity

\[ \text{PPL} = \exp\left(-\frac{1}{n} \sum_{i=1}^{n} \log P_\theta(x_i \mid x_{<i})\right) = \exp(\mathcal{L}) \]

Perplexity is the exponentiated loss — interpretable as "how many tokens is the model choosing between on average." A perplexity of 20 means the model is as uncertain as if it were choosing uniformly among 20 tokens at each step.


Python Implementation

import numpy as np


def stable_softmax(logits):
    z = logits - np.max(logits, axis=-1, keepdims=True)
    e = np.exp(z)
    return e / np.sum(e, axis=-1, keepdims=True)


def causal_mask(seq_len):
    """Lower-triangular mask: position i can attend to positions 0..i."""
    return np.tril(np.ones((seq_len, seq_len)))


def causal_self_attention(x, W_q, W_k, W_v, W_o):
    """Single-head causal self-attention (GPT-style)."""
    seq_len, d = x.shape
    Q, K, V = x @ W_q, x @ W_k, x @ W_v

    scores = (Q @ K.T) / np.sqrt(d)
    mask = causal_mask(seq_len)
    scores = np.where(mask == 0, -1e9, scores)

    attn = stable_softmax(scores)
    return (attn @ V) @ W_o


def gpt2_block(x, W_q, W_k, W_v, W_o, W1, b1, W2, b2, g1, g2):
    """Simplified GPT-2 block: LN -> Attention -> Residual -> LN -> FFN -> Residual."""
    # Pre-norm attention (GPT-2 uses pre-LN, unlike the original Transformer)
    h = layer_norm(x, g1)
    h = causal_self_attention(h, W_q, W_k, W_v, W_o)
    x = x + h

    # Pre-norm FFN with GELU
    h = layer_norm(x, g2)
    h = gelu(h @ W1 + b1) @ W2 + b2
    x = x + h
    return x


def layer_norm(x, gamma, eps=1e-6):
    mean = np.mean(x, axis=-1, keepdims=True)
    var = np.var(x, axis=-1, keepdims=True)
    return gamma * (x - mean) / np.sqrt(var + eps)


def gelu(x):
    """Gaussian Error Linear Unit — smooth ReLU used in GPT-2."""
    return 0.5 * x * (1 + np.tanh(np.sqrt(2 / np.pi) * (x + 0.044715 * x**3)))


def compute_nll(logits, target_ids):
    """
    Compute per-token negative log-likelihood.
    logits: [seq_len, vocab_size] — model output logits
    target_ids: [seq_len] — next token at each position
    """
    probs = stable_softmax(logits)
    nll = 0.0
    for t in range(len(target_ids)):
        nll -= np.log(probs[t, target_ids[t]] + 1e-12)
    return nll / len(target_ids)


def perplexity(logits, target_ids):
    return np.exp(compute_nll(logits, target_ids))


def top_k_sampling(logits, k=10, temperature=1.0):
    """Sample from the top-k logits with temperature."""
    logits = logits / temperature
    top_indices = np.argsort(logits)[-k:]
    top_logits = logits[top_indices]
    probs = stable_softmax(top_logits)
    chosen = np.random.choice(top_indices, p=probs)
    return chosen


def nucleus_sampling(logits, p=0.9, temperature=1.0):
    """Top-p (nucleus) sampling."""
    logits = logits / temperature
    probs = stable_softmax(logits)
    sorted_idx = np.argsort(-probs)
    sorted_probs = probs[sorted_idx]
    cumsum = np.cumsum(sorted_probs)

    cutoff = np.searchsorted(cumsum, p) + 1
    allowed = sorted_idx[:cutoff]
    allowed_probs = probs[allowed]
    allowed_probs = allowed_probs / allowed_probs.sum()

    return np.random.choice(allowed, p=allowed_probs)


# --- Demo ---
if __name__ == "__main__":
    np.random.seed(42)
    seq_len, vocab_size = 8, 100

    fake_logits = np.random.randn(seq_len, vocab_size)
    target_ids = np.random.randint(0, vocab_size, seq_len)

    nll = compute_nll(fake_logits, target_ids)
    ppl = perplexity(fake_logits, target_ids)
    print(f"NLL: {nll:.4f}")
    print(f"Perplexity: {ppl:.2f}")

    last_logits = fake_logits[-1]
    print(f"\nTop-k sample: {top_k_sampling(last_logits, k=5)}")
    print(f"Nucleus sample: {nucleus_sampling(last_logits, p=0.9)}")

Interview Importance

GPT-2 is essential for understanding decoder-only LMs, prompting, and sampling strategies. Most interview questions about language model fundamentals trace to concepts this paper introduced.

Difficulty Level: ⭐⭐ (Medium)


Interview Questions & Answers

Q1: Define zero-shot vs. few-shot in the GPT family.

Answer: - Zero-shot: The model receives only a task description/prompt with no examples. "Translate English to French: Hello →" - Few-shot: The model receives a few input-output examples in the prompt before the query. GPT-2 is primarily a zero-shot model (it was evaluated zero-shot). GPT-3 later demonstrated strong few-shot ability by including examples in the context. - Key difference from fine-tuning: No gradient updates happen. The model uses the prompt examples purely as context for its next-token predictions.

Q2: Why does byte-level BPE help with rare words and multiple languages?

Answer: Byte-level BPE starts from raw bytes (256 base units) rather than Unicode characters or words. Any text can be represented — no [UNK] tokens. For rare words like "pneumonoultramicroscopicsilicovolcanoconiosis," BPE breaks them into known subword pieces. For non-Latin scripts (Chinese, Arabic, emoji), byte-level encoding handles them without language-specific preprocessing.

Q3: Name a failure mode of relying on zero-shot transfer for production.

Answer: Zero-shot performance is unreliable and unpredictable. Small prompt changes can dramatically shift output quality. The model may: - Generate plausible-sounding but factually wrong text - Fail to follow formatting instructions consistently - Produce biased or harmful content (no alignment training) - Perform well on benchmarks but poorly on domain-specific tasks

This is why production systems use fine-tuning (SFT), RLHF, or RAG rather than pure zero-shot prompting.

Q4: What is the difference between GPT-2's pre-LN and the original Transformer's post-LN?

Answer: The original Transformer applies LayerNorm after the residual connection: LN(x + sublayer(x)). GPT-2 applies LayerNorm before the sublayer: x + sublayer(LN(x)). Pre-LN is more stable during training because gradients flow through the residual path without being modified by normalization. This allowed training deeper models without careful learning rate warmup.

Q5: Explain temperature, top-k, and top-p sampling.

Answer: - Temperature \(\tau\): Divides logits by \(\tau\) before softmax. \(\tau < 1\) sharpens the distribution (more deterministic), \(\tau > 1\) flattens it (more random). - Top-k: Only consider the \(k\) highest-probability tokens, renormalize, then sample. Prevents sampling from the long tail of unlikely tokens. - Top-p (nucleus): Keep the smallest set of tokens whose cumulative probability exceeds \(p\). Adapts the number of candidates dynamically — for peaked distributions, only a few tokens are considered; for flat distributions, many are.


Connections to Other Papers

  • Transformer → GPT-2 uses the decoder stack with causal masking
  • GPT-3 → Scales GPT-2 to 175B, adds few-shot in-context learning
  • InstructGPT → Aligns GPT-3 with human preferences via RLHF
  • BERT → Contrasting approach: bidirectional encoder, MLM objective
  • LLaMA → Modern open-weight GPT-style model with RMSNorm, SwiGLU, RoPE

Key Takeaways for Quick Review

Concept Remember
Architecture Decoder-only Transformer with causal masking
Pre-training Next-token prediction (autoregressive NLL)
Key innovation Zero-shot task transfer via prompting
Tokenizer Byte-level BPE (no unknown tokens)
Norm placement Pre-LN (before sublayer, not after)
Activation GELU (smooth ReLU)
Sizes 117M, 345M, 762M, 1.5B parameters