Top 35 Research Papers in Large Language Models¶

This section provides in-depth documentation for the 35 landmark papers that shaped modern LLMs. Each paper page includes:

Simple math explanations — every equation broken down step by step
Python implementations — runnable code demonstrating the core ideas
Interview importance — why interviewers ask about it and difficulty level
Interview Q&A — specific questions with detailed answers
Connections — how each paper relates to others

How to Use This Section¶

Quick review: Read the TL;DR and Key Takeaways table at the bottom of each page
Deep study: Work through the math section and run the Python code
Interview prep: Focus on the Interview Q&A section — practice explaining these aloud
Build connections: Follow the "Connections to Other Papers" links to see how ideas evolved

Papers by Category¶

Architecture Foundations¶

#	Paper	Year	Key Contribution
1	Attention Is All You Need	2017	Transformer architecture — self-attention replaces recurrence
2	BERT	2018	Bidirectional encoder with masked language modeling
3	GPT-2	2019	Decoder-only LM with zero-shot task transfer
5	T5	2019	Unified text-to-text framework with span corruption
6	XLNet	2019	Permutation language modeling for bidirectional AR
34	RoPE/RoFormer	2021	Rotary Position Embedding — relative position via rotation

Training Recipes & Scaling¶

#	Paper	Year	Key Contribution
4	GPT-3	2020	175B params, in-context learning, scaling laws
7	RoBERTa	2019	Recipe matters — dynamic masking, no NSP, more data
8	ELECTRA	2020	Replaced-token detection — 100% token utilization
10	PaLM	2022	540B dense model, Pathways infrastructure, emergence
11	Chinchilla	2022	Compute-optimal training — balance params and data
12	LLaMA	2023	Open weights, RMSNorm + SwiGLU + RoPE standard

Alignment & Instruction Following¶

#	Paper	Year	Key Contribution
9	InstructGPT	2022	RLHF pipeline — SFT → Reward Model → PPO
17	FLAN	2022	Instruction tuning on 1,800+ tasks
21	Constitutional AI	2022	RLAIF — AI feedback guided by principles

Efficient Architecture & Serving¶

#	Paper	Year	Key Contribution
13	LoRA	2021	Low-rank adaptation — freeze base, train tiny updates
14	FlashAttention	2022	IO-aware exact attention — tiling in SRAM
15	Mistral 7B	2023	GQA + sliding window attention for efficient 7B
16	Mixtral	2024	Sparse MoE — 8 experts, top-2 routing
24	Mamba	2023	Linear-time SSM with selective state spaces
35	Attention Residuals	2026	Replace fixed residuals with attention over layers

Reasoning, Tools & Agents¶

#	Paper	Year	Key Contribution
18	Chain-of-Thought	2022	Reasoning via intermediate steps in prompts
19	ReAct	2023	Thought → Action → Observation agent loop
20	Toolformer	2023	Self-supervised API calling via loss comparison

Multimodal & Code¶

#	Paper	Year	Key Contribution
22	CLIP	2021	Contrastive image-text learning, zero-shot classification
23	Codex	2021	Code generation, HumanEval benchmark, pass@k metric
25	Gemini	2023	Native multimodal — text, images, audio, video

Chinese Lab Contributions (2024–2026)¶

The Chinese research ecosystem produced several landmark papers in 2024–2026 that fundamentally changed how the field thinks about reasoning, KV cache efficiency, MoE training, agentic systems, and multi-agent coordination. These are essential reading for any LLM interview in 2025–2026.

#	Paper	Lab	Year	Key Contribution
26	DeepSeek-V2	DeepSeek	2024	Multi-Head Latent Attention (MLA) — 93% KV cache reduction
27	DeepSeek-V3	DeepSeek	2024	671B MoE, auxiliary-loss-free load balancing, FP8 training, MTP
28	DeepSeek-R1	DeepSeek	2025	RL-only reasoning (R1-Zero), GRPO, cold-start distillation
29	ChatGLM / GLM-4	Zhipu AI / THU	2024	GLM pretraining unifies MLM+CLM; bilingual at scale
30	Kimi k1.5	Moonshot AI	2025	Long-context RL scaling; partial rollouts; online mirror descent
31	Qwen2.5	Alibaba	2024	18T tokens, 72B matches 405B Llama; Qwen2.5-Math/Coder family
32	GLM-5	Zhipu AI / THU	2026	744B MoE, Slime async RL, SotA open-weight agentic coding
33	Kimi K2.5	Moonshot AI	2026	1T MoE, Agent Swarm (PARL), native multimodality with MoonViT

Paper Summaries — Context, Motivation, and Key Ideas¶

Architecture Foundations¶

Attention Is All You Need (Vaswani et al., 2017) — Before this paper, sequence models relied on recurrence (RNNs, LSTMs), which processed tokens sequentially and created a fundamental bottleneck: training could not be parallelized across time steps, and long-range dependencies faded through the chain. The Transformer replaced recurrence entirely with self-attention — every token attends to every other token in a single operation, enabling full parallelization on GPUs. Multi-head attention lets the model capture different relationship types (syntax, coreference, semantics) simultaneously. This architecture became the backbone of every modern LLM.

BERT (Devlin et al., 2018) — GPT-1 showed that Transformers could be pre-trained, but its left-to-right autoregressive objective meant each token could only attend to previous tokens — missing context from the right. Google's BERT introduced masked language modeling (MLM): randomly mask 15% of tokens and predict them using bidirectional context. This enabled pre-training deep bidirectional representations, then fine-tuning on downstream tasks (classification, NER, QA) with minimal task-specific architecture. BERT demonstrated that pre-training + fine-tuning was the dominant paradigm, achieving state-of-the-art on 11 NLP benchmarks simultaneously.

GPT-2 (Radford et al., 2019) — OpenAI scaled the GPT architecture to 1.5B parameters and trained on 40GB of web text (WebText). The key finding was zero-shot task transfer: without any fine-tuning, GPT-2 could perform summarization, translation, and question answering by simply framing tasks as text completion. This challenged the prevailing "pre-train then fine-tune" paradigm and suggested that language models might be implicit multi-task learners. The controversial decision to initially withhold the full model due to misuse concerns also sparked the first AI safety debate around open-weight releases.

T5 (Raffel et al., 2019) — Google's Text-to-Text Transfer Transformer reframed every NLP task as text generation: classification becomes "sentiment positive" or "sentiment negative" as output text, translation becomes "translate English to German: [text]", etc. This unified framework enabled systematic comparison of pre-training objectives, model sizes, and data strategies. T5 introduced span corruption (masking contiguous spans rather than individual tokens) and demonstrated that the same architecture, objective, and fine-tuning procedure works for any text task — simplifying the ML pipeline from many task-specific models to one.

XLNet (Yang et al., 2019) — XLNet identified a fundamental conflict: BERT's masked LM objective sees all context but introduces artificial [MASK] tokens that never appear in real text (pretrain-finetune discrepancy), while autoregressive models see no [MASK] but can only attend left-to-right. XLNet resolved this with permutation language modeling — train an autoregressive model on all possible orderings of the input, so it learns bidirectional context without masking. This required a novel two-stream attention mechanism. XLNet outperformed BERT on 18 benchmarks, though its complexity limited widespread adoption.

RoPE/RoFormer (Su et al., 2021) — RoPE (Rotary Position Embedding) replaced additive positional encodings with multiplicative rotations of query and key vectors. By rotating Q and K in 2D subspaces using position-dependent angles, RoPE naturally encodes relative position in the dot product without any learned parameters. The geometric frequency progression (\(\theta_i = 10000^{-2i/d}\)) creates a natural locality bias and excellent length extrapolation. RoPE became the industry standard for positional encoding, adopted by LLaMA, Mistral, Qwen, PaLM, and virtually every modern open-weight LLM. Extensions like NTK-aware scaling and YaRN enable 4-32× length extrapolation beyond training context.

Training Recipes & Scaling¶

GPT-3 (Brown et al., 2020) — OpenAI scaled to 175B parameters and discovered that sufficiently large language models can perform tasks through in-context learning: provide a few input-output examples in the prompt, and the model generalizes to new inputs without any gradient updates. This was a paradigm shift — no fine-tuning, no task-specific heads, just prompt engineering. GPT-3 also established the scaling laws suggesting that performance improves predictably with model size, data, and compute, motivating the race to ever-larger models.

RoBERTa (Liu et al., 2019) — Facebook showed that BERT was significantly undertrained. By removing the Next Sentence Prediction objective (which hurt performance), using dynamic masking (different masks each epoch instead of static), training on 10× more data with larger batches, and training for longer, RoBERTa matched or exceeded XLNet without any architectural changes. The lesson: training recipes matter as much as architecture. This paper influenced how every subsequent model was trained.

ELECTRA (Clark et al., 2020) — BERT's MLM only learns from the 15% of tokens that are masked — 85% of tokens provide no training signal. ELECTRA replaced MLM with "replaced token detection": a small generator model corrupts tokens, and the main model predicts which tokens are original vs. replaced. This uses 100% of tokens for training, making ELECTRA 4× more compute-efficient than BERT. The approach demonstrated that the pre-training objective design can dramatically affect efficiency without changing the Transformer architecture itself.

PaLM (Chowdhery et al., 2022) — Google trained a 540B-parameter dense Transformer on 6,144 TPU v4 chips using the Pathways infrastructure, which enabled training across multiple TPU pods. PaLM demonstrated "emergence": capabilities that appear suddenly at scale rather than gradually. At 540B, PaLM could solve multi-step reasoning problems, explain jokes, and generate code — tasks that smaller models failed at completely. The paper also introduced the concept of "discontinuous" scaling improvements that challenged smooth power-law predictions.

Chinchilla (Hoffmann et al., 2022) — DeepMind proved that most large models were undertrained. The Chinchilla scaling law showed that for a fixed compute budget, model parameters and training tokens should be scaled equally. GPT-3 (175B params, 300B tokens) was over-parameterized and under-trained; Chinchilla (70B params, 1.4T tokens) matched GPT-3's performance at 4× lower inference cost. This paper redirected the field from "bigger models" to "more data" and directly influenced LLaMA, Mistral, and every subsequent training budget decision.

LLaMA (Touvron et al., 2023) — Meta trained a family of models (7B to 65B) using only publicly available data and open-sourced the weights. LLaMA-13B outperformed GPT-3 (175B) by applying Chinchilla-optimal training (1.4T tokens for 13B params). It also standardized the modern architectural recipe: RMSNorm (simpler than LayerNorm), SwiGLU activations (better than ReLU), and RoPE positional encoding (better extrapolation than sinusoidal). LLaMA democratized LLM research — every open-weight model since (Mistral, Qwen, Llama 2/3) builds on this foundation.

Alignment & Instruction Following¶

InstructGPT (Ouyang et al., 2022) — OpenAI's InstructGPT addressed the fundamental misalignment problem: language models trained on internet text learn to predict the next token, not to follow human instructions or be helpful. The paper introduced the three-stage RLHF pipeline — (1) supervised fine-tuning on human demonstrations, (2) training a reward model on human preference comparisons, (3) optimizing the policy via PPO against the reward model. A 1.3B InstructGPT was preferred over the 175B GPT-3 by human evaluators, proving that alignment is more important than raw scale.

FLAN (Wei et al., 2022) — Google showed that instruction tuning — fine-tuning on a diverse mixture of 1,800+ tasks with natural language instructions — dramatically improves zero-shot performance on unseen tasks. Unlike InstructGPT's RLHF approach, FLAN uses only supervised fine-tuning, making it simpler and cheaper. The paper demonstrated that the diversity and formatting of instruction data matters more than its volume, and that scaling the number of tasks improves generalization to held-out tasks.

Constitutional AI (Bai et al., 2022) — Anthropic addressed the scalability bottleneck of RLHF: collecting human preference labels is expensive and slow. Constitutional AI replaces human annotators with a written set of principles (a "constitution") that the model uses to critique and revise its own outputs. This RLAIF (RL from AI Feedback) approach enables explicit, auditable safety rules that can be updated without retraining, while maintaining output quality comparable to RLHF-trained models.

Efficient Architecture & Serving¶

LoRA (Hu et al., 2021) — Full fine-tuning of a 175B model requires storing and updating all parameters — impractical for most organizations. Hu et al. showed that the weight updates during fine-tuning have low intrinsic rank: instead of updating the full weight matrix W, decompose the update into two small matrices (A and B) where ΔW = BA. For a 175B model, LoRA reduces trainable parameters from 175B to ~4M (0.002%) while matching full fine-tuning performance. This made fine-tuning accessible on consumer GPUs and enabled serving multiple LoRA adapters on a single base model.

FlashAttention (Dao et al., 2022) — Standard attention computes the full N×N attention matrix in GPU HBM (slow memory), creating both a memory bottleneck (O(N²) space) and an IO bottleneck (moving data between HBM and SRAM). FlashAttention uses tiling to compute attention block by block in fast SRAM without ever materializing the full attention matrix. The result: exact attention (not approximate) that is 2-4× faster and uses O(N) memory instead of O(N²). This made training with longer sequences (4K → 64K → 128K) practically feasible.

Mistral 7B (Jiang et al., 2023) — Mistral demonstrated that a 7B model with the right architecture can outperform the 13B LLaMA 2. The key innovations were Grouped-Query Attention (GQA, sharing KV heads to reduce cache size), Sliding Window Attention (limiting attention to a fixed window for efficient long sequences), and aggressive training on high-quality data. Mistral proved that model efficiency — not just scale — determines practical deployment viability for edge and cost-sensitive applications.

Mixtral (Jiang et al., 2024) — Mixtral is a Sparse Mixture of Experts model with 8 expert FFN modules per layer, routing each token to the top-2 experts. Total parameters: 47B, but active parameters per token: ~13B. This means Mixtral has the knowledge capacity of a 47B model with the inference cost of a 13B model. The paper demonstrated that sparse MoE with simple top-k routing can match dense models 6× their active size, making MoE the dominant architecture for cost-effective large-scale deployment.

Mamba (Gu & Dao, 2023) — Transformers have O(N²) attention complexity, making long sequences expensive. Mamba introduced Selective State Space Models that process sequences in O(N) time by making the state transition parameters input-dependent (selective). Unlike fixed-parameter linear RNNs, Mamba can decide what information to remember or forget at each step. It matched Transformer quality on language modeling while enabling much longer effective context windows. Mamba opened the exploration of non-attention architectures for LLMs.

Attention Residuals (Kimi Team, 2026) — For nearly a decade, Transformer residual connections used fixed unit-weight addition: \(x_{\ell+1} = x_\ell + f_\ell(x_\ell)\). Attention Residuals replaced this with learned, input-dependent attention over all preceding layer outputs: \(x_L = \sum_{\ell=0}^{L-1} \alpha_\ell h_\ell\). This addresses PreNorm's uncontrolled hidden-state growth and signal dilution in deep networks. Block AttnRes partitions layers into blocks for efficiency: \(O(B^2 + L^2/B)\) vs \(O(L^2)\). Integrated into Kimi Linear (48B MoE) with 1.4T token pretraining, demonstrating consistent scaling gains and more uniform gradient distributions.

Reasoning, Tools & Agents¶

Chain-of-Thought (Wei et al., 2022) — Large language models failed at multi-step reasoning when asked to produce answers directly. Wei et al. showed that simply adding "Let's think step by step" or providing few-shot examples with intermediate reasoning steps dramatically improved performance on math, logic, and commonsense tasks. Chain-of-Thought prompting requires no fine-tuning — it works by giving the model "thinking space" to decompose problems. This technique is now a standard component of every LLM system and inspired the entire reasoning model paradigm (o1, R1).

ReAct (Yao et al., 2023) — Before ReAct, LLMs either reasoned without grounding (chain-of-thought, prone to hallucination) or used tools without reasoning (no ability to plan or recover from errors). ReAct interleaves reasoning traces ("I need to search for X because...") with grounded actions (search, lookup, calculate), then observes the results before deciding the next step. This Thought → Action → Observation loop became the foundational architecture for LLM agents because it combines the planning ability of CoT with the factual grounding of tool use.

Toolformer (Schick et al., 2023) — Previous tool-use approaches required either fine-tuning on human-annotated tool calls or complex prompting. Toolformer taught a model to use tools through self-supervision: the model generates candidate API calls, inserts them into training text, and keeps only those calls that reduce the language modeling loss (i.e., calls that actually help prediction). This created a model that naturally interleaves text generation with calculator, search, and translation API calls without any human annotation of tool usage.

Multimodal & Code¶

CLIP (Radford et al., 2021) — Before CLIP, vision models were trained on fixed label sets (ImageNet's 1,000 classes), making them brittle and narrow. CLIP trained a vision encoder and text encoder jointly on 400M image-text pairs from the internet using contrastive learning — matching images with their captions. This produced a model that understands visual concepts through natural language, enabling zero-shot classification on any category ("a photo of a dog", "a photo of a cat") without any labeled training data. CLIP's visual encoder became the foundation for multimodal models (LLaVA, Gemini, GPT-4V).

Codex (Chen et al., 2021) — OpenAI fine-tuned GPT-3 on publicly available code from GitHub, creating a model that translates natural language specifications into working programs. Codex introduced the HumanEval benchmark (164 hand-written programming problems) and the pass@k metric (probability that at least one of k samples passes all test cases). Codex powered GitHub Copilot and demonstrated that code generation is a natural extension of language modeling — code has precise semantics that can be validated automatically, making it the ideal domain for LLM evaluation.

Gemini (Google DeepMind, 2023) — Unlike models that bolt vision encoders onto language models (LLaVA-style), Gemini was pre-trained natively on interleaved text, images, audio, and video from the start. This native multimodal training means the model doesn't have an "adapter gap" between modalities — it reasons across them fluently. Gemini demonstrated state-of-the-art performance across 30+ benchmarks spanning language, code, math, vision, and audio, establishing that a single foundation model can excel across all modalities when they are trained jointly.

Timeline¶

Year	Papers
2017	Transformer
2018	BERT
2019	GPT-2, XLNet, RoBERTa, T5
2020	GPT-3, ELECTRA
2021	CLIP, Codex, LoRA, RoPE
2022	InstructGPT, PaLM, Chinchilla, FlashAttention, FLAN, Chain-of-Thought, Constitutional AI
2023	LLaMA, Mistral 7B, ReAct, Toolformer, Mamba, Gemini
2024	Mixtral, DeepSeek-V2, DeepSeek-V3, ChatGLM/GLM-4, Qwen2.5
2025	DeepSeek-R1, Kimi k1.5
2026	GLM-5, Kimi K2.5, Attention Residuals

Paper Interconnections¶

flowchart LR
    T[Transformer] --> BERT & GPT2 & T5
    GPT2 --> GPT3
    GPT3 --> Inst[InstructGPT] & PaLM & CoT & Codex
    Inst --> CAI
    PaLM --> Chinchilla --> LLaMA --> Mistral --> Mixtral
    T5 --> FLAN
    CoT --> ReAct --> Toolformer
    CLIP --> Gemini
    PaLM --> Gemini
    T --> FlashAttention & Mamba
    LLaMA --> DSV2[DeepSeek-V2] --> DSV3[DeepSeek-V3] --> DSR1[DeepSeek-R1]
    Mixtral --> DSV3
    DSR1 --> KimiK15[Kimi k1.5]
    KimiK15 --> KimiK25[Kimi K2.5]
    BERT --> GLM4[GLM-4]
    GLM4 --> GLM5[GLM-5]
    DSR1 --> GLM5
    GPT3 --> Qwen25[Qwen2.5]
    Inst --> DSR1
    CoT --> DSR1
    RoPE[RoPE] --> LLaMA
    RoPE --> Mistral & Qwen25 & PaLM
    T --> RoPE
    DSV3 --> AttnRes[Attention Residuals]
    KimiK25 --> AttnRes

Threads:

Architecture: Transformer → BERT/GPT/T5 → FlashAttention / Mistral / Mamba
Positional Encoding: Transformer → RoPE → LLaMA/Mistral/Qwen (modern standard)
Scaling: GPT-3 → PaLM → Chinchilla → LLaMA
Alignment: InstructGPT → Constitutional AI / FLAN
Tools & Agents: Chain-of-Thought → ReAct → Toolformer
Multimodal: CLIP → Gemini
Chinese lab thread: LLaMA → DeepSeek-V2 (MLA) → DeepSeek-V3 (MoE+FP8) → DeepSeek-R1 (GRPO) → Kimi k1.5 (long-context RL) → Kimi K2.5 (Agent Swarm)
Bilingual/GLM thread: BERT → GLM-4 → GLM-5 (async RL, agentic coding)
Residual connections: Transformer → Attention Residuals (adaptive depth aggregation)

Interview Priority Guide¶

If you have limited time, focus on these papers first (highest interview frequency):

Transformer — asked in every LLM interview
InstructGPT/RLHF — mandatory for alignment questions
LoRA — essential for fine-tuning and serving questions
FlashAttention — key for systems/infrastructure roles
LLaMA — modern architecture standard (RMSNorm, SwiGLU, RoPE)
RoPE — essential architecture knowledge; positional encoding standard
Chinchilla — scaling and training budget decisions
Chain-of-Thought — reasoning and prompt engineering
BERT — encoder vs. decoder understanding
GPT-3 — in-context learning and scaling laws
ReAct — agent and tool-use systems
DeepSeek-R1 — now essential; GRPO, RL-only reasoning, distillation
DeepSeek-V2/V3 — MLA for KV cache, MoE load balancing
Kimi k1.5 — long-context RL as an alternative to MCTS
GLM-5 — async RL for agentic coding, MoE serving
Kimi K2.5 — multi-agent RL (PARL), native multimodality
Attention Residuals — emerging 2026 topic; transformer depth redesign

Cross-Paper Interview Scenarios¶

These scenarios draw on multiple papers and test your ability to connect ideas across the field. Each links to the relevant deep dives.

Scenario 1: MoE Serving Under Load¶

"You're serving a 671B MoE model. Walk me through memory vs. compute trade-offs and when expert parallelism breaks down."

Draws on: Mixtral (top-2 routing basics), DeepSeek-V3 (auxiliary-loss-free balancing, FP8), GLM-5 (744B MoE serving), Kimi K2.5 (384 experts, 8 active).

Key points: total vs. active parameters determine memory vs. compute; expert parallelism requires all-to-all communication; load imbalance wastes GPU cycles; bias-based routing (V3) vs. auxiliary loss (Mixtral) have different stability profiles.

Scenario 2: Async RL for Tool-Heavy Tasks¶

"Your RL training loop is bottlenecked by slow compiler/test-runner calls. How do you keep GPUs utilized?"

Draws on: DeepSeek-R1 (synchronous GRPO), GLM-5 (Slime async RL, V-trace correction).

Key points: synchronous RL idles GPUs waiting for slowest rollout; Slime decouples rollout workers from policy updates; V-trace importance weighting corrects for stale policies; trade-off is bias from staleness vs. throughput gain.

Scenario 3: KV Cache Efficiency Stack¶

"Design the inference stack for a 128K-context model. Which optimizations compose and which conflict?"

Draws on: DeepSeek-V2 (MLA, 93% cache reduction), FlashAttention (IO-aware tiling), Inference: KV Cache, Inference: Continuous Batching.

Key points: MLA compresses KV into low-rank latent; FlashAttention tiles computation in SRAM; continuous batching maximizes throughput; PagedAttention manages fragmented memory; all three compose — MLA reduces what's cached, FlashAttention speeds the compute, batching fills GPU utilization.

Scenario 4: Single Agent vs. Multi-Agent Systems¶

"When should you use a single powerful agent with tools vs. a multi-agent swarm?"

Draws on: GLM-5 (single agent, SWE-bench), Kimi K2.5 (Agent Swarm, PARL), ReAct (thought-action loop), Toolformer (self-supervised tool use).

Key points: single agents are simpler and have full context; multi-agent parallelizes independent subtasks; PARL rewards both quality and speedup; coordination overhead can negate parallelism gains for tightly coupled tasks; DAG dependency analysis determines decomposability.

Scenario 5: Distillation vs. On-Student RL¶

"You have a strong reasoning teacher. When do you distill vs. run RL directly on the student?"

Draws on: DeepSeek-R1 (R1-Distill vs. GRPO), GLM-5 (distillation for coding agents), Chinchilla (compute-optimal trade-offs).

Key points: distillation (SFT on teacher traces) is cheaper and transfers style; RL on student handles distribution shift and new domains; distillation may not transfer robustness without diverse/hard-negative data; compute budget determines which is feasible.

Scenario 6: Data Quality as a Scaling Multiplier¶

"Qwen2.5-72B matches a 405B model. What does this say about compute-optimal training?"

Draws on: Qwen2.5 (18T tokens, data quality), Chinchilla (compute-optimal formula), LLaMA (open training recipe).

Key points: Chinchilla formula assumes fixed data quality; effective data \(D_{\text{eff}} = \text{quality}(D) \cdot |D|\); curated 18T tokens > raw 100T tokens; specialized post-training (Math, Coder) further leverages the base investment.

Scenario 7: Evaluation Hygiene and Benchmark Leakage¶

"How do you ensure your model's benchmark scores are trustworthy?"

Draws on: Qwen2.5 (evaluation methodology), GLM-5 (SWE-bench Verified), Kimi K2.5 (multi-benchmark evaluation).

Key points: API vs. open-weight reproducibility; version-pinned benchmarks; contamination detection; held-out private test sets; human evaluation as ground truth calibration; SWE-bench Verified uses real GitHub PRs with test suites.

Deep Dive: Chinese Lab Papers¶

DeepSeek-V2 (2024) — Multi-Head Latent Attention¶

Why it matters: DeepSeek-V2's MLA mechanism reduced KV cache by 93% compared to standard MHA — enabling 128K context windows at practical inference cost. This is now a standard technique interviewers probe for in systems roles.

Core idea: instead of caching full K and V matrices (which grow as \(O(T \cdot d_{\text{model}})\)), compress them into a low-rank latent vector \(c_t\):

\[ c_t = W^{DKV} h_t \in \mathbb{R}^{d_c}, \quad d_c \ll d_{\text{model}} \cdot n_h \]

At attention time, decompress:

\[ K_t = W^{UK} c_t, \quad V_t = W^{UV} c_t \]

In Plain English

Instead of storing a 1024-dimensional K and V for each of 64 heads at every token position, store a single ~512-dimensional compressed vector \(c_t\). At inference, project it back to K and V on the fly. You trade a small amount of compute for a huge reduction in memory bandwidth.

Key results: 236B total / 21B active MoE model; 93.3% KV cache reduction; 5.76× higher generation throughput vs. DeepSeek 67B.

Interview questions:

How does MLA differ from MQA (Multi-Query Attention) and GQA (Grouped-Query Attention)? Which provides the largest KV cache savings?
Why does the KV cache size matter for serving throughput — what is the memory bandwidth bottleneck during decode?
MLA stores a compressed \(c_t\) instead of full K/V — what is the quality trade-off, and how is it mitigated?

DeepSeek-V3 (2024) — MoE at 671B Scale¶

Why it matters: DeepSeek-V3 trained a 671B-parameter MoE model (37B active) in 2.788M GPU-hours on H800s — remarkably efficient at this scale. Three innovations drove this: auxiliary-loss-free load balancing, FP8 mixed-precision training, and multi-token prediction (MTP) for denser training signals.

Auxiliary-loss-free load balancing: traditional MoE uses an auxiliary loss to prevent expert collapse (all tokens routing to one expert). DeepSeek-V3 instead adjusts per-expert bias terms dynamically during training:

\[ g_{i,t} = \text{Softmax}\bigl(e_{i,t} + b_i\bigr) \]

where \(b_i\) is updated by \(\pm\gamma\) depending on whether expert \(i\) is over- or under-utilized. No auxiliary loss term pollutes the primary objective.

In Plain English

Instead of penalizing imbalanced routing through a separate loss (which fights the primary task loss), add a learnable bias to each expert's routing score. Over-utilized experts get a lower bias (making them less likely to be chosen), under-utilized ones get a higher bias. The primary loss stays clean.

Multi-Token Prediction (MTP): alongside the standard next-token head, auxiliary heads predict tokens \(t+2, t+3, \ldots, t+k\) during training. These heads are discarded at inference but provide denser gradient signal during training.

FP8 training: weights, activations, and gradients computed in FP8 with block-wise scaling. Enables ~2× FLOP throughput and halved memory bandwidth vs. BF16.

Interview questions:

What is expert collapse in MoE training, and how do traditional auxiliary losses address it? What is the disadvantage of auxiliary losses?
How does bias-adjustment load balancing avoid the auxiliary loss problem while still achieving balanced routing?
What does multi-token prediction during training provide that standard next-token prediction does not?
Why is FP8 training harder than BF16 training, and what calibration strategies make it stable?

DeepSeek-R1 (2025) — RL-Only Reasoning¶

Why it matters: R1-Zero proved that reasoning capability is latent in pre-trained weights and can be elicited purely through RL — no labeled CoT data required. This changed how the field thinks about the relationship between pretraining, SFT, and alignment.

See the DeepSeek-R1 deep dive for the full GRPO derivation and worked example, or the Recent Advances entry for a frontier summary.

Interview questions:

R1-Zero trains with RL but no SFT. What does this imply about what is "in" a pretrained model before alignment?
Compare the GRPO group-relative advantage to the PPO critic — when would you prefer each?
Why is distillation from R1 (SFT on R1 traces) often more compute-efficient than running GRPO on the student directly?

ChatGLM / GLM-4 (Zhipu AI / Tsinghua, 2022–2024)¶

Why it matters: the GLM series demonstrated a third pretraining paradigm alongside MLM (BERT) and CLM (GPT): autoregressive blank infilling. GLM-4 and GLM-4.6 became competitive frontier models with particularly strong Chinese-English bilingual quality and long-context generation.

GLM pretraining objective: randomly sample text spans, replace them with mask tokens, then autoregressively predict each span given the masked context. Unlike BERT, spans are generated left-to-right. Unlike GPT, the context is bidirectional:

\[ \mathcal{L}_{\text{GLM}} = -\mathbb{E}\left[\sum_{s \in \mathcal{S}} \sum_{i=1}^{|s|} \log P_\theta(s_i \mid \mathbf{x}_{\text{corrupt}},\, s_1,\ldots,s_{i-1})\right] \]

where \(\mathcal{S}\) is a set of sampled spans and \(\mathbf{x}_{\text{corrupt}}\) has those spans masked.

In Plain English

The model sees the whole document (with some spans blanked out), and must generate each blanked span token by token. This trains bidirectional understanding (it sees future context around the blank) AND autoregressive generation (it must generate the blank causally). One objective, both capabilities.

Key results (GLM-4.6): 357B total / 32B active; 200K input context; 128K output tokens; MIT license.

Interview questions:

How does the GLM blank-infilling objective differ from BERT's MLM — specifically in what attention mask pattern each uses?
What downstream tasks is the GLM objective better suited for compared to pure CLM, and why?
Why is a 200K output token limit practically significant for agentic and long-form generation tasks?

Kimi k1.5 (Moonshot AI, 2025) — Long-Context RL Scaling¶

Why it matters: Kimi k1.5 matched o1 on AIME and MATH 500 without MCTS, value functions, or process reward models — a simpler RL recipe than most expected to reach frontier reasoning quality. The key insight: a 128K-token context window lets the model reason, reflect, and self-correct within a single trajectory.

See the Kimi k1.5 deep dive for the mirror descent formulation, or the Recent Advances entry for a frontier summary.

Interview questions:

How does partial rollout reuse reduce the compute cost of training with 128K-token sequences?
Why does long context itself improve reasoning quality in RL training, even without a process reward model?
Kimi k1.5 outperforms GPT-4o on short-CoT tasks. What does this suggest about the relationship between reasoning training and general capability?

Qwen2.5 (Alibaba, 2024)¶

Why it matters: Qwen2.5-72B matches Llama-3-405B-Instruct — 5× larger — on many benchmarks, making it a landmark data-efficiency result. The key: scaling pre-training data from 7T to 18T high-quality tokens and investing in post-training alignment via 1M+ SFT examples and multi-stage RL.

Chinchilla revisited: Qwen2.5 demonstrates that the compute-optimal frontier has shifted — with sufficiently curated data, smaller models trained on more tokens substantially outperform larger models trained on less. The Chinchilla formula applies, but data quality is a multiplier on the token count:

\[ L \approx A N^{-\alpha} + B D_{\text{eff}}^{-\beta} \]

where \(D_{\text{eff}} = \text{quality}(D) \cdot |D|\) — effective data depends on quality, not just volume.

Specialized variants: Qwen2.5-Math, Qwen2.5-Coder, QwQ (reasoning), and multimodal Qwen-VL demonstrate that a single pre-trained base can be efficiently specialized via targeted post-training.

Interview questions:

Qwen2.5-72B matches a 405B model. What does this say about the Chinchilla compute-optimal hypothesis, and how does data quality modify it?
How does the strategy of training specialized variants (Math, Coder) from a single base compare to training separate models from scratch?
What does Alibaba's model portfolio (open-weight base + API MoE variants + specialized) tell you about the economics of LLM deployment?

GLM-5 (Zhipu AI / Tsinghua, 2026) — Agentic Engineering with Async RL¶

Why it matters: GLM-5 (744B MoE, 40B active) achieves 77.8 on SWE-bench Verified — the strongest open-weight result for autonomous code repair, approaching Claude Opus 4.5. The key innovation is Slime, an asynchronous RL framework that decouples rollout generation from policy updates, keeping GPUs utilized even when agentic rollouts involve slow tool execution (compilers, test runners).

Slime async RL: synchronous RL wastes GPU cycles waiting for the slowest tool-interaction rollout. Slime uses rollout workers with slightly stale policy copies, buffering completed episodes. V-trace importance weighting corrects for the staleness gap:

\[ \rho_t = \min\left(\bar{\rho},\, \frac{\pi_\theta(a_t \mid s_t)}{\mu(a_t \mid s_t)}\right) \]

In Plain English

Slime is like a kitchen where chefs work independently instead of waiting for the slowest dish. A quality controller (importance weights) ensures slightly stale preparations are still usable for the next course.

Interview questions:

How does Slime's asynchronous RL differ from synchronous PPO/GRPO, and when is async preferable?
With 744B total but only 40B active parameters, what determines serving cost — total or active parameters?
Compare the GLM pretraining objective (blank infilling) with standard CLM for agentic code repair tasks — why might bidirectional context help?

Kimi K2.5 (Moonshot AI, 2026) — Multi-Agent Coordination with PARL¶

Why it matters: Kimi K2.5 (1T MoE, 32B active, 384 experts) introduces Agent Swarm — a system that orchestrates up to 100 parallel sub-agents with 1,500 tool calls per workflow. The coordinator is trained with Parallel-Agent Reinforcement Learning (PARL), which rewards both task-completion quality and parallelism speedup. Native multimodality via MoonViT 400M enables visual-to-code generation from UI screenshots.

PARL reward:

\[ R(D, y) = R_{\text{quality}}(y, q) + \beta \cdot \frac{T_{\text{sequential}}}{T_{\text{parallel}}(D)} \]

The speedup ratio rewards decompositions that achieve genuine parallelism — the coordinator learns which tasks can be parallelized vs which have sequential dependencies.

In Plain English

PARL is MapReduce for LLM tasks — the coordinator learns to map work to parallel sub-agents and reduce results into a final output. The RL reward incentivizes both getting the right answer AND doing it faster via parallelism.

Interview questions:

How does PARL differ from prompt-based multi-agent frameworks (AutoGen, CrewAI)?
What are the trade-offs between native multimodal pretraining (K2.5) vs bolt-on vision encoders (LLaVA)?
With 384 experts and 8 active per token, what is the memory footprint and what are the load balancing challenges?