Emergent Capabilities and Scaling Laws¶

Why This Matters for LLMs¶

Scaling laws govern the economics and planning of LLM development. They relate test loss (and, indirectly, downstream capability) to model size, dataset size, and training compute, which lets teams answer budget questions: given a fixed GPU cluster for six months, should you train a larger model for fewer steps or a smaller model on more tokens? The Chinchilla work showed that many early recipes were compute-suboptimal—oversized models trained on too few tokens—reshaping how frontier labs allocate data versus parameters.

Emergent capabilities—abilities that appear weak or absent below a threshold scale, then rise sharply—are among the most debated phenomena in modern AI. Practitioners care because roadmaps and safety assumptions often hinge on whether skills like multi-step arithmetic, in-context learning, or tool use smoothly improve with scale or jump at critical points. Interviewers frequently probe whether “emergence” is partly a measurement artifact (discrete metrics, thresholded scoring) versus a genuine qualitative change in internal representations.

Understanding test-time compute scaling completes the picture: systems like long chain-of-thought reasoning or search-with-verifier decouple thinking time from parameter count. In product terms, sometimes it is cheaper to run a smaller model with more inference-time search than to train a vastly larger model—especially when latency budgets and hardware footprints are fixed. Together, scaling laws, emergence, and test-time scaling frame research strategy, hiring narratives, and how you communicate trade-offs to leadership.

Core Concepts¶

Scaling Laws (Kaplan et al., 2020)¶

Empirical studies found that language modeling loss on held-out data often follows power-law trends when varying compute \(C\), model size \(N\), or dataset size \(D\) (each in isolation or along optimal slices). A simplified schematic relationship for loss as a function of non-embedding parameter count is:

\[ L(N) = \left(\frac{N_c}{N}\right)^{\alpha_N} + L_\infty \]

where \(\alpha_N\) is a positive exponent (empirically on the order of \(10^{-2}\)), \(N_c\) is a constant from the fit, and \(L_\infty\) is an irreducible floor from Bayes error and noise.

In Plain English

Think of \(L(N)\) as “how surprised the model is on average” when you grow width/depth. Bigger \(N\) means more knobs to fit patterns, so average loss drops in a predictable straight line on log-log plots—until you hit the floor \(L_\infty\) where more parameters cannot squeeze more signal from the data.

Similarly, for dataset size \(D\) in tokens:

\[ L(D) = \left(\frac{D_c}{D}\right)^{\alpha_D} + L_\infty . \]

In Plain English

More tokens means more examples of spelling, syntax, facts, and reasoning patterns. The model’s loss falls as the world in the data gets richer—again, often linearly on log-log axes—until redundancy and noise dominate.

For total training compute \(C\) (forward-plus-backward FLOPs, often summarized in log space), a compact schematic is:

\[ L(C) = \left(\frac{C_c}{C}\right)^{\alpha_C} + L_\infty . \]

In Plain English

Compute bundles “how big” and “how long you train.” If you move along an optimal frontier, spending more FLOPs buys lower loss in a stable way—if optimization and data quality keep up. The law is descriptive, not a guarantee your stack implements the frontier efficiently.

Implication: loss is partially predictable from budget, which supports capacity planning—but not alignment, truthfulness, or safety, which can diverge from raw loss.

Worked Example: Reading a Power-Law Fit

Suppose a lab fits \(\log(L - L_\infty) = b - \alpha_C \log C\) on several runs. They observe \(\alpha_C \approx 0.05\) on a log-log plot over a decade of compute. If \(L - L_\infty\) drops from \(0.40\) to \(0.32\) when compute increases from \(10^{23}\) to \(10^{24}\) FLOPs (holding data/model mix near optimal), check consistency: a multiplicative increase of \(10\times\) in \(C\) should shrink \((L-L_\infty)\) by roughly \(10^{-\alpha_C} \approx 10^{-0.05} \approx 0.89\). Starting from \(0.40\), \(0.40 \times 0.89 \approx 0.356\), not \(0.32\)—so either the fit region is non-stationary, the frontier is not optimal, or \(L_\infty\) shifted. This is how practitioners sanity-check scaling extrapolations before committing to a 10× spend.

Chinchilla Scaling (Hoffmann et al., 2022)¶

Chinchilla argues that for a fixed compute budget, many prior models were undertrained: too many parameters relative to tokens seen. A widely cited rule of thumb from the paper is that compute-optimal training uses on the order of ~20 tokens per parameter (not universal; depends on details and updated recipes).

Let \(N\) be parameter count and \(D\) the total training tokens. Define the data-to-parameter ratio \(r = D/N\). Chinchilla-style analysis implies many historical runs had \(r\) far below the compute-optimal frontier—e.g. GPT-3–scale comparisons often quote 175B parameters and 300B tokens, giving:

\[ r_{\text{GPT-3}} \approx \frac{300 \times 10^9}{175 \times 10^9} \approx 1.7\ \text{tokens/param}. \]

In Plain English

1.7 tokens per parameter means each weight saw only a couple of training tokens on average across the run—far fewer than the ~20× regime Chinchilla highlights for compute efficiency. Intuitively, the model is huge but underfed with gradients from diverse tokens relative to what’s optimal for that FLOP budget.

By contrast, Chinchilla 70B trained on ~1.4T tokens gives:

\[ r_{\text{Chinchilla 70B}} \approx \frac{1.4 \times 10^{12}}{70 \times 10^9} = 20 . \]

In Plain English

Here each parameter is updated in a context where the optimizer has seen many more distinct tokens per capacity unit—matching the story that smaller-but-longer-trained can beat bigger-but-shorter-trained under equal FLOPs.

A coarse forward–backward FLOP estimate ties parameters to tokens:

\[ C \approx 6 N D \]

(counting operations per token per parameter with standard Transformer training assumptions; the 6 is an architectural constant absorbed into logs in practice).

In Plain English

Doubling \(N\) or \(D\) both increase FLOPs linearly in this sketch—so there is a trade-off: spend compute on width/depth versus sequence exposure. Chinchilla rebalanced toward more \(D\) than many earlier frontier runs.

Worked Example: Comparing Two Budgets

Scenario A: \(N_A = 1.2 \times 10^{11}\) parameters, \(D_A = 2.4 \times 10^{11}\) tokens \(\Rightarrow r_A = 2\) tokens/param. Scenario B: \(N_B = 7 \times 10^{10}\) parameters, \(D_B = 1.4 \times 10^{12}\) tokens \(\Rightarrow r_B = 20\). Using \(C \approx 6ND\):
\(C_A \approx 6 \times 1.2 \times 10^{11} \times 2.4 \times 10^{11} = 1.728 \times 10^{23}\).
\(C_B \approx 6 \times 7 \times 10^{10} \times 1.4 \times 10^{12} = 5.88 \times 10^{23}\).
Scenario B uses more total FLOPs—not an apples-to-apples IsoFLOP comparison—but illustrates how higher \(r\) tracks more evidence per parameter. For equal \(C\), IsoFLOP analysis picks an interior optimum \((N^\star, D^\star)\); interview answers should mention IsoFLOP curves rather than only raw \(r\).

Emergent Capabilities¶

Emergent capabilities refer to task performance that is near zero (or random) for small models, then rises rapidly past a threshold—e.g. certain reasoning benchmarks, multi-digit arithmetic, or instruction following in some settings.

Let \(p_s(m)\) be the success probability of model size \(m\) on a binary skill benchmark. A sharp “emergent” profile might look like a sigmoid:

\[ p_s(m) = \sigma\big(\beta (\log m - \log m_0)\big) = \frac{1}{1 + e^{-\beta(\log m - \log m_0)}} . \]

In Plain English

Small changes in \(\log m\) can swing success probability from 5% to 70% if \(\beta\) is large—looks like a phase transition. In practice, some of this steepness may come from hard thresholds in scoring (exact match) or small test sets.

Debate: critics argue some emergence is an artifact of nonlinear metrics or test noise; defenders point to consistent thresholds across seeds and tasks. The responsible position: treat emergence as operationally real (sudden utility gains) while remaining skeptical of overfitted narratives.

Worked Example: Thresholds and Sample Size

Suppose on a 100-item exact-match benchmark, a small model gets 3/100 correct (3%), a medium gets 12/100 (12%), a large gets 45/100 (45%). The jump looks emergent. If you expand to 2,000 items, you might see smooth improvement: 4%, 9%, 22%, 41%—suggesting the “jump” was partially a small-\(n\) effect. Always tie emergence claims to confidence intervals and continuous auxiliary metrics (e.g. token-level edit distance).

In-Context Learning¶

In-context learning (ICL) is the phenomenon where the model adapts its predictions from examples embedded in the prompt, without weight updates. Let \(x\) be a query and \(e_1,\ldots,e_k\) be \(k\) demonstration input–output pairs concatenated into context \(z_k = [e_1;\ldots;e_k;x]\). The model implements:

\[ P(y \mid x) \approx P_\theta(y \mid z_k) . \]

In Plain English

The same weights \(\theta\) serve many tasks; the prompt selects a task by example. You are not calling optimizer.step()—you are conditioning the forward pass on a long context that acts like a soft program.

Shot regimes: zero-shot (no examples), few-shot (small \(k\)), many-shot (large \(k\) within context window \(L_{\max}\)). A token budget constraint:

\[ \underbrace{T_{\text{sys}} + T_{\text{instr}}}_{\text{fixed}} + k \cdot T_{\text{ex}} + T_x \le L_{\max} . \]

In Plain English

Every extra example steals tokens from instructions or from the user query—ICL trades breadth of demonstrations against room for the actual problem.

Chain-of-Thought Reasoning¶

Chain-of-thought (CoT) prompting asks the model to emit intermediate reasoning before the final answer. Empirically, self-generated scratch work improves multi-step math and logic at sufficient scale; tiny models may not benefit or may even degrade by rambling.

A stylized decomposition for answer correctness with CoT:

\[ P(\text{correct}) \approx P(\text{valid steps}) \times P(\text{correct final} \mid \text{valid steps}) . \]

In Plain English

CoT helps when the model can reliably produce checkable intermediate structure. If steps are gibberish, the final answer is still a guess—verifiers and self-consistency (sample many chains, majority vote) address this.

Variants: Tree-of-Thought (search over partial chains), self-consistency (vote over diverse chains), and process supervision (reward each step)—all increase effective compute at inference.

Test-Time Compute Scaling¶

The o1 / o3 paradigm emphasizes spending FLOPs during inference: repeated sampling, beam search, learned verifiers, or reinforcement-style lookahead. Given candidates \(\mathcal{A} = \{a_1,\ldots,a_N\}\) sampled from the policy LM, select:

\[ a^\star = \arg\max_{a \in \mathcal{A}} V(a \mid x) \]

for a verifier \(V\) (could be a second model, unit tests, or a reward model).

In Plain English

Instead of one greedy decode, you try multiple drafts and grade them—like submitting several proofs and submitting the one that checks out.

Best-of-\(N\) improves pass probability if individual samples are imperfect but independent enough. If single-shot success is \(p\), the probability at least one of \(N\) i.i.d. samples succeeds is:

\[ P_{\ge 1} = 1 - (1-p)^N . \]

In Plain English

Even modest \(p\) (say 0.25) grows quickly: \(N=8\) gives \(1 - 0.75^8 \approx 0.90\) if samples are independent and verification is perfect—real systems violate both assumptions.

Worked Example: Best-of-\(N\) with Imperfect Verifier

Let \(p = 0.3\) be the chance the model produces a correct solution, and let the verifier have 90% recall on correct answers and 95% precision on flagged wrong answers (numbers for illustration only). Then the probability a correct solution is accepted is \(0.3 \times 0.9 = 0.27\) per sample; wrong solutions might slip through with rate \((1-0.3) \times (1-0.95) = 0.035\) per sample if “accept” is the default—showing verification quality caps test-time gains. Increasing \(N\) helps discovery of correct chains, but cannot fix a systematically blind \(V\).

Deep Dive: IsoFLOP Frontiers and Updated Constants

IsoFLOP analysis holds total training FLOPs fixed and sweeps model/data schedules to find minimal loss. Reported exponents and token-parameter ratios change with architecture (MoE depth, attention kernels), tokenizer, and data mixture. Treat 20 tokens/param as a historical anchor from Chinchilla-scale dense models, not a universal law. Always cite measured curves on your actual stack.

Deep Dive: Emergence as Metric Geometry

Some papers argue apparent jumps arise because accuracy is a coarse function of a smooth underlying capability (e.g. continuous per-token likelihood). When a latent skill crosses a decision boundary, accuracy spikes. This does not deny utility—it reframes what is emergent (capabilities vs measured scores).

Predictability, Surprise, and Planning Horizons¶

Scaling laws give predictable average improvements in loss, yet surprise remains: new behaviors (e.g. stronger instruction following) may appear earlier on some architectures or data mixes than power-law extrapolation from tiny models would suggest. Teams should separate engineering predictability (loss down, perplexity down) from capability forecasting (passes code tests, safe refusals).

Let \(\Delta L\) be the change in validation loss from a baseline run. A useful but rough heuristic links perplexity \(\mathrm{PPL} = \exp(L)\) to relative improvement:

\[ \frac{\mathrm{PPL}_2}{\mathrm{PPL}_1} = \exp(L_2 - L_1) = \exp(\Delta L). \]

In Plain English

Cutting loss by 0.1 nats multiplies perplexity by \(e^{-0.1} \approx 0.90\)—a modest-looking loss delta can mean 10% lower average surprise per token. This is why tiny loss gaps at scale feel meaningless in logs but move human-facing quality.

Self-Consistency and Reasoning at Inference¶

Self-consistency samples multiple full answers (or chains) and aggregates by majority vote or ranking. If \(K\) chains are sampled i.i.d. with probability \(q\) each of reaching the correct final answer, the majority-vote success with odd \(K\) obeys a binomial tail—roughly, error probability decreases exponentially in \(K\) when \(q > 0.5\). When \(q < 0.5\), voting hurts, exposing the need for diverse sampling or better base policies.

\[ P(\text{majority correct} \mid K, q) = \sum_{i=\lceil K/2 \rceil}^{K} \binom{K}{i} q^i (1-q)^{K-i}. \]

In Plain English

Voting only works if individual chains are more often right than wrong. Otherwise you’re amplifying a shared mistake—diversity (temperature, prompt variants) matters as much as \(K\).

Worked Example: Majority Vote with \(K=5\), \(q=0.4\)

If each chain independently hits the correct answer with \(q = 0.4\), the majority of five is correct when 3, 4, or 5 chains are correct: [ P = \sum_{i=3}^{5} \binom{5}{i} (0.4)^i (0.6)^{5-i}. ] Compute: \(i=3\): \(10 \times 0.064 \times 0.1296 \approx 0.083\); \(i=4\): \(5 \times 0.0256 \times 0.216 \approx 0.028\); \(i=5\): \(0.01024 \approx 0.010\). Sum \(\approx 0.121\). Below single-chain \(0.4\)—majority vote degrades performance when \(q < 0.5\). Contrast with \(q=0.55\): majority success rises above \(0.55\), illustrating verifier/policy quality prerequisites.

Meta-Learning View of In-Context Learning¶

Some analyses interpret ICL as implicit optimization in the attention layers: the forward pass maps demonstrations to a task vector that steers predictions on \(x\). While informal, the picture helps explain why larger models ICL better—more layers provide a richer computational substrate for in-forward adaptation.

Code¶

The following scripts are self-contained: they use numpy and matplotlib for scaling visuals and a toy simulation for chain-of-thought vs direct answering (no API keys). Install dependencies with pip install numpy matplotlib.

"""
Scaling-law visualization and toy CoT vs direct comparison (emergent_capabilities.py).
Run: python emergent_capabilities.py
"""
from __future__ import annotations

import numpy as np
import matplotlib.pyplot as plt


def power_law_loss(n: np.ndarray, n_c: float, alpha: float, l_inf: float) -> np.ndarray:
    """L(N) = (N_c/N)^alpha + L_inf  (schematic Kaplan-style loss)."""
    return (n_c / n) ** alpha + l_inf


def chinchilla_style_flops(n: float, d_tokens: float) -> float:
    """Order-of-magnitude FLOPs estimate C ~ 6 N D (forward+backward sketch)."""
    return 6.0 * n * d_tokens


def best_of_n(p: float, n: int) -> float:
    """P(at least one success) for independent Bernoulli trials with success p."""
    return 1.0 - (1.0 - p) ** n


def toy_cot_vs_direct(
    rng: np.random.Generator,
    n_models: int = 12,
    base: float = 50.0,
) -> tuple[np.ndarray, np.ndarray, np.ndarray]:
    """
    Toy: 'capability' grows log-linearly with size; 'CoT gain' turns on past a threshold.
    Not empirical—illustrates staged gains for pedagogy.
    """
    sizes = np.logspace(1.0, 3.0, n_models)  # abstract "scale" units
    latent = np.clip(np.log(sizes / base), -2.0, 3.0)
    direct_acc = 1.0 / (1.0 + np.exp(-0.8 * (latent - 0.2)))
    cot_bonus = np.where(latent > 0.5, 0.25 * (latent - 0.5), 0.0)
    noise_d = rng.normal(0, 0.02, size=direct_acc.shape)
    noise_c = rng.normal(0, 0.015, size=cot_bonus.shape)
    return sizes, np.clip(direct_acc + noise_d, 0.0, 1.0), np.clip(
        direct_acc + cot_bonus + noise_c, 0.0, 1.0
    )


def main() -> None:
    rng = np.random.default_rng(0)

    # --- Plot 1: schematic loss vs parameters ---
    n_grid = np.logspace(8, 11, 200)
    loss = power_law_loss(n_grid, n_c=1e10, alpha=0.076, l_inf=1.5)
    plt.figure(figsize=(8, 5))
    plt.loglog(n_grid, loss - 1.5, label=r"$L - L_\infty$ (schematic)")
    plt.xlabel("Parameters N")
    plt.ylabel("Excess loss")
    plt.title("Schematic power-law scaling (illustrative exponents)")
    plt.grid(True, which="both", ls="--", alpha=0.4)
    plt.legend()
    plt.tight_layout()
    plt.savefig("scaling_loss_vs_N.png", dpi=150)
    plt.close()

    # --- Chinchilla-style FLOP comparison ---
    n_a, d_a = 1.2e11, 2.4e11
    n_b, d_b = 7e10, 1.4e12
    c_a = chinchilla_style_flops(n_a, d_a)
    c_b = chinchilla_style_flops(n_b, d_b)
    print(f"C_A ~ {c_a:.3e} FLOPs (r = {d_a/n_a:.2f} tok/param)")
    print(f"C_B ~ {c_b:.3e} FLOPs (r = {d_b/n_b:.2f} tok/param)")

    # --- Best-of-N ---
    p_single = 0.28
    for n in (1, 4, 8, 16):
        print(f"Best-of-{n:2d} success approx {best_of_n(p_single, n):.3f} (p={p_single})")

    # --- Toy CoT plot ---
    sizes, acc_d, acc_c = toy_cot_vs_direct(rng)
    plt.figure(figsize=(8, 5))
    plt.semilogx(sizes, acc_d, marker="o", label="Direct answer (toy)")
    plt.semilogx(sizes, acc_c, marker="s", label="With CoT bonus past threshold (toy)")
    plt.xlabel("Abstract model scale")
    plt.ylabel("Accuracy")
    plt.ylim(0.0, 1.05)
    plt.grid(True, ls="--", alpha=0.4)
    plt.legend()
    plt.title("Toy illustration: staged gains (not real benchmark data)")
    plt.tight_layout()
    plt.savefig("toy_cot_vs_direct.png", dpi=150)
    plt.close()
    print("Wrote scaling_loss_vs_N.png and toy_cot_vs_direct.png")


if __name__ == "__main__":
    main()

Interview Guide¶

FAANG-Level Questions

What are scaling laws, and what quantities do Kaplan-style power laws relate? Answer: Scaling laws are empirical fits showing that validation loss (often) decreases as a power law in model size \(N\), dataset size \(D\), or training compute \(C\)—straight lines on log–log plots over many orders of magnitude. They support capacity planning (how much compute buys how much loss reduction) but do not directly guarantee safety or downstream task wins. Exponents shift with architecture and data—treat fits as measured on your stack.
What did Chinchilla change about the conventional wisdom on model size versus data size? Answer: For a fixed FLOP budget, Chinchilla argued many models were over-parameterized and under-trained—optimal points use more tokens per parameter than earlier frontier recipes (e.g., very large models on modest data). The practical shift: invest in data and training steps, not only widening models, when chasing compute-optimal loss. IsoFLOP analysis formalizes the trade-off versus one-dimensional “bigger is always better.”
Give the order-of-magnitude relationship \(C \approx 6ND\) and explain what each symbol means. Answer: \(C\) is total training FLOPs (forward+backward, order-of-magnitude), \(N\) is non-embedding parameter count, and \(D\) is tokens processed—the 6 absorbs per-token forward/backward multiply-adds through a Transformer under standard counting assumptions. It is a planning identity: doubling data or parameters both linearly raise FLOPs in this sketch—Chinchilla searches the best \((N,D)\) pair on an iso-\(C\) slice.
Define emergent capabilities and one reason critics call emergence a metric artifact. Answer: Emergent capabilities are skills that appear near zero for small models then rise sharply past a scale threshold (e.g., multi-step reasoning on some benchmarks). Critics note nonlinear metrics (exact match) can turn smooth probability improvements into apparent jumps, and small test sets yield noisy curves—emergence may reflect measurement geometry as much as a phase change. Still, product teams may care about thresholds even when curves are partly metric-driven.
How does in-context learning differ from fine-tuning at the optimization level? Answer: ICL adapts only via the forward pass on a prompt—no gradient updates to weights; the model acts like it is doing task inference from examples in context. Fine-tuning runs SGD (or LoRA) to move \(\theta\) toward a task distribution. ICL is flexible and instant but limited by context length; fine-tuning persists skills but risks catastrophic forgetting and needs data governance.
Why does chain-of-thought prompting help more at larger scales than at small scales (qualitatively)? Answer: CoT asks the model to emit structured intermediate steps—larger models better follow the scaffold and recover from local errors, while small models may ramble or derail, hurting final accuracy. At scale, internal representations support reliable decomposition; below threshold, extra tokens add noise without verifiable structure. Verifiers and self-consistency amplify gains when step validity improves.
Write the formula for best-of-\(N\) success under independence and name two assumptions that break in production. Answer: With independent samples each succeeding with probability \(p\), \(P(\text{at least one success}) = 1 - (1-p)^N\). Assumptions that break: samples are often correlated (same failure mode, temperature too low) and verification is imperfect—false negatives discard good answers, false positives accept bad ones. Real systems need diversity (prompts, models) and calibrated judges.
What is test-time compute scaling, and how does it interact with verifier quality? Answer: Test-time scaling spends extra inference compute—sampling many chains, search, verifiers, refinement—rather than enlarging pretraining. It improves pass rates when marginal samples explore different reasoning paths, but gains cap at verifier recall/precision: a blind verifier cannot reward rare correct solutions; a lenient one admits hallucinations. Budget \(N\) trades latency/cost vs reliability.
How would you design an experiment to test whether a capability is truly “emergent” versus smoothly improving? Answer: Measure continuous metrics (token likelihood on steps, partial credit) alongside thresholded scores; sweep model sizes with tight CIs and multiple seeds; enlarge test sets to rule out small-\(n\) jumps; check alternative metrics (edit distance vs exact match). If smooth underlying curves explain “jumps” in accuracy, report both—utility may still threshold sharply for product.
When would you prefer investing FLOPs in training data versus test-time search? Answer: Prefer more/better pretraining or SFT data when failures are systematic (knowledge gaps, style) and latency budgets are tight—fixing the base model reduces repeated search tax. Prefer test-time search when tasks are verifiable (code tests, math checks) and a smaller model + verifier beats a larger static model under the same serving SLA. Cost profiles differ: training is amortized; search is per query.

Follow-up Probes

“If loss predicts downstream tasks poorly in your domain, what would you measure instead?”
“How does mixture-of-experts change FLOP accounting versus dense Chinchilla analysis?”
“What’s the failure mode of best-of-\(N\) when samples are correlated?”
“How would contamination of benchmarks bias emergence curves?”

Key Phrases to Use in Interviews

“IsoFLOP frontier: optimal \((N, D)\) pair for fixed training compute.”
“Compute-optimal training balances width/depth against tokens seen.”
“Emergence is operationally real for product thresholds even when metrics are coarse.”
“In-context learning is Bayesian conditioning through the prompt, not weight updates.”
“Test-time scaling trades latency for accuracy via search and verification.”

References¶

Kaplan, J., et al. (2020). Scaling Laws for Neural Language Models. arXiv:2001.08361.
Hoffmann, J., et al. (2022). Training Compute-Optimal Large Language Models (Chinchilla). arXiv:2203.15556.
Wei, J., et al. (2022). Emergent Abilities of Large Language Models. arXiv:2206.07682.
Schaeffer, R., et al. (2023). Are Emergent Abilities of Large Language Models a Mirage? arXiv:2304.15004.
Brown, T., et al. (2020). Language Models are Few-Shot Learners (GPT-3). NeurIPS.
Wei, J., et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. NeurIPS.
Snell, C., et al. (2024). Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters. arXiv:2408.03314.
OpenAI (2024). Learning to Reason with LLMs (o1 overview). OpenAI research blog.