GPT-3: Language Models are Few-Shot Learners¶

Authors: Tom B. Brown, Benjamin Mann, Nick Ryder, and 29 more
Year: 2020 | Venue: NeurIPS
Link: arXiv:2005.14165

TL;DR¶

GPT-3 scales autoregressive Transformers to 175 billion parameters and demonstrates in-context learning: conditioning on a prompt with several input-output examples changes the model's behavior at inference time without any gradient updates. The paper maps scaling laws — performance improves predictably with model size, data, and compute — while documenting limitations such as factual errors, biases, and reasoning gaps.

Why This Paper Matters¶

GPT-3 is the paper that launched the API era of LLMs. It showed that a single model could perform translation, QA, code generation, arithmetic, and more — just by changing the prompt. It's also the reference point for:

Few-shot prompting as a production technique
Scaling laws and emergence debates
The motivation for RLHF (GPT-3 is capable but not aligned)
API products (OpenAI's API was built around GPT-3)

Key Concepts Explained Simply¶

In-Context Learning (ICL)¶

The model learns "what task to do" from examples provided in the prompt, not from training data. No weights are updated. The mechanism is still debated, but empirically, providing 3-5 examples dramatically improves accuracy on many tasks.

There are three regimes: - Zero-shot: Task description only → "Translate to French:" - One-shot: One example → "Hello → Bonjour. Goodbye →" - Few-shot: Multiple examples → 3-10 input-output pairs before the query

Scaling Laws¶

The paper shows smooth, predictable relationships between: - Model size (N): More parameters → lower loss - Data size (D): More tokens → lower loss - Compute (C): More FLOPs → lower loss

These follow approximate power laws. The Chinchilla paper later revised the optimal N-vs-D trade-off.

Emergent Abilities¶

Some capabilities appear to "emerge" only at large scale — they're near-random at small sizes and suddenly work at large sizes. Examples include 3-digit arithmetic and analogical reasoning. This concept is controversial (some argue it's a measurement artifact).

The Math — Explained Step by Step¶

In-Context Learning Formulation¶

Given demonstration pairs \((x^{(1)}, y^{(1)}), \ldots, (x^{(k)}, y^{(k)})\) concatenated as context \(c\):

\[ P_\theta(y \mid x, c) = \prod_{t=1}^{|y|} P_\theta(y_t \mid x, c, y_{<t}) \]

Breaking it down:

The model treats the demonstrations + query as one long sequence
It predicts the answer token-by-token, conditioning on everything before it
No parameters change — the model uses the same weights as it would for any text completion
The demonstrations act as an implicit specification of the task

Scaling Law (Simplified)¶

\[ L(N) \approx \left(\frac{N_c}{N}\right)^{\alpha_N} \]

where \(N\) is the number of parameters, \(N_c\) is a constant, and \(\alpha_N \approx 0.076\). Loss decreases as a power law in model size. Similar laws hold for data and compute.

FLOPs Estimate¶

For a Transformer with \(N\) parameters processing \(D\) tokens:

\[ C \approx 6 \cdot N \cdot D \]

The factor of 6 comes from: 2 (multiply-add) × 3 (forward + backward = 3× forward). This is a useful rule of thumb for estimating training cost.

Python Implementation¶

import numpy as np


def stable_softmax(logits):
    z = logits - np.max(logits, axis=-1, keepdims=True)
    e = np.exp(z)
    return e / np.sum(e, axis=-1, keepdims=True)


def few_shot_prompt(examples, query, task_description=""):
    """
    Construct a few-shot prompt from examples.

    examples: list of (input, output) tuples
    query: the input to get an answer for
    """
    parts = []
    if task_description:
        parts.append(task_description + "\n\n")
    for inp, out in examples:
        parts.append(f"Input: {inp}\nOutput: {out}\n\n")
    parts.append(f"Input: {query}\nOutput:")
    return "".join(parts)


def scaling_law_loss(N, D, A=406.4, alpha=0.34, B=410.7, beta=0.28, L_inf=1.69):
    """
    Approximate loss as a function of model params N and data tokens D.
    Based on Kaplan et al. / GPT-3 scaling observations.
    """
    return A / (N ** alpha) + B / (D ** beta) + L_inf


def estimate_flops(N, D):
    """Approximate training FLOPs for a Transformer."""
    return 6 * N * D


def compute_pass_at_k(n, c, k):
    """
    pass@k metric (used in code evals, introduced later by Codex but
    relevant for evaluating GPT-3 on code-like tasks).
    n: total samples, c: correct samples, k: samples to consider
    """
    from math import comb
    if n - c < k:
        return 1.0
    return 1.0 - comb(n - c, k) / comb(n, k)


def simulate_icl_accuracy(n_shots, base_accuracy=0.25, gain_per_shot=0.12,
                          saturation=0.85):
    """
    Simulated in-context learning: accuracy improves with more shots
    but saturates (diminishing returns).
    """
    acc = base_accuracy + (saturation - base_accuracy) * (1 - np.exp(-gain_per_shot * n_shots))
    return min(acc, saturation)


def compute_contamination_score(test_set, training_data_ngrams, n=13):
    """
    Simplified benchmark contamination check: what fraction of test
    examples have n-gram overlaps with training data?
    """
    contaminated = 0
    for example in test_set:
        tokens = example.split()
        for i in range(len(tokens) - n + 1):
            ngram = " ".join(tokens[i:i+n])
            if ngram in training_data_ngrams:
                contaminated += 1
                break
    return contaminated / len(test_set) if test_set else 0.0


# --- Demo ---
if __name__ == "__main__":
    prompt = few_shot_prompt(
        examples=[
            ("What is 2+3?", "5"),
            ("What is 10-4?", "6"),
            ("What is 7*3?", "21"),
        ],
        query="What is 15/3?",
        task_description="Answer the math question."
    )
    print("Few-shot prompt:")
    print(prompt)
    print()

    # Scaling law visualization
    sizes = [125e6, 350e6, 1.3e9, 6.7e9, 13e9, 175e9]
    D = 300e9
    print("Model Size → Loss:")
    for N in sizes:
        loss = scaling_law_loss(N, D)
        flops = estimate_flops(N, D)
        print(f"  {N/1e9:>6.1f}B params | Loss: {loss:.3f} | FLOPs: {flops:.2e}")

    print("\nICL accuracy vs. number of shots:")
    for k in [0, 1, 2, 4, 8, 16, 32]:
        acc = simulate_icl_accuracy(k)
        print(f"  {k:>2}-shot: {acc:.1%}")

Interview Importance¶

GPT-3 is the canonical reference for in-context learning, scaling, and prompt engineering. Almost every LLM interview touches on concepts from this paper.

Difficulty Level: ⭐⭐⭐ (Medium-High)¶

Interview Questions & Answers¶

Q1: What is in-context learning, and why is it NOT fine-tuning?¶

Answer: In-context learning (ICL) means the model changes its behavior based on examples provided in the prompt, but no gradient updates occur. The model weights are frozen. The mechanism likely works because pre-training on diverse text implicitly teaches the model to recognize and continue patterns. It's fundamentally different from fine-tuning where the weights are updated to optimize a task-specific loss.

Why it matters: ICL enables a single model to serve many tasks via API. Fine-tuning creates specialized models that need separate deployment. ICL is cheaper to try but less reliable for production use.

Q2: Discuss scaling laws at a high level. What trades off against what?¶

Answer: Scaling laws describe power-law relationships between model size (N), dataset size (D), compute budget (C), and test loss (L). Key insights: - All three (N, D, C) independently reduce loss with diminishing returns - For a fixed compute budget \(C \propto N \times D\), there's an optimal allocation between N and D - GPT-3 favored large N with moderate D; Chinchilla later showed D should be much larger - These laws hold remarkably well across orders of magnitude but break in edge cases (data quality, repeated epochs, domain shift)

Q3: How would you detect benchmark leakage or memorization?¶

Answer: Several approaches: 1. N-gram overlap analysis: Check if test examples (or long n-grams from them) appear verbatim in training data 2. Canary strings: Insert unique strings in test sets and check if the model can complete them 3. Held-out decontamination: Train on data explicitly filtered to exclude test-set-similar text, compare performance 4. Membership inference: Test whether the model assigns unusually high probability to test examples vs. paraphrased versions 5. Performance gap analysis: If a model performs suspiciously well on a public benchmark but poorly on a similar private one, contamination is likely

Q4: What are the limitations of GPT-3 that motivated RLHF?¶

Answer: GPT-3 optimizes language model loss (predict next token), not user utility. This means: - It may generate factually incorrect but fluent text (hallucinations) - It may produce harmful, biased, or toxic content (training data reflects the internet) - It doesn't inherently follow instructions — it predicts what text follows, not what the user wants - It can be verbose or evasive when a direct answer is better - RLHF addresses this by fine-tuning the model to maximize a reward signal trained on human preferences

Q5: How many parameters does GPT-3 have, and what's the rough FLOPs cost?¶

Answer: GPT-3 has 175B parameters, trained on approximately 300B tokens. Using the \(C \approx 6ND\) rule:

\(C \approx 6 \times 175 \times 10^9 \times 300 \times 10^9 = 3.15 \times 10^{23}\) FLOPs

At the time, this cost millions of dollars in compute. For context, training on a single A100 at ~312 TFLOPS would take roughly 12,000 GPU-years.

Connections to Other Papers¶

GPT-2 → GPT-3 scales the same architecture 100× and adds few-shot ICL
InstructGPT → Aligns GPT-3 with RLHF to improve helpfulness/safety
Chinchilla → Revises GPT-3's scaling strategy (more data, smaller model)
PaLM → Google's 540B model, similar scaling with different infrastructure
Chain-of-Thought → Extends ICL with reasoning traces for complex tasks

Key Takeaways for Quick Review¶

Concept	Remember
Model size	175B parameters
Key insight	In-context learning without gradient updates
Training data	~300B tokens from filtered web text
Scaling law	Loss ∝ power law in N, D, C
FLOPs rule	C ≈ 6 × N × D
Limitation	Optimizes LM loss, not user utility → motivated RLHF
Prompt types	Zero-shot, one-shot, few-shot