LLaMA: Open and Efficient Foundation Language Models¶
Authors: Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière
Year: 2023 | Venue: arXiv
Link: arXiv:2302.13971
TL;DR¶
LLaMA trains openly released foundation models (7B–65B parameters) using Chinchilla-optimal token budgets — significantly more training tokens relative to model size. Architecture tweaks include RMSNorm (replacing LayerNorm), SwiGLU activations, Rotary Position Embeddings (RoPE), and efficient attention. The goal: maximum performance per parameter for research and local deployment.
Why This Paper Matters¶
LLaMA triggered an explosion of open-source LLM development:
- Open weights enabled an entire ecosystem: Alpaca, Vicuna, Llama 2, Llama 3, Llama 4
- Proved that smaller, well-trained models can match much larger ones
- Established RMSNorm + SwiGLU + RoPE as the standard modern architecture
- Made LLM research accessible without massive compute budgets
- 7B/13B models run on consumer GPUs, enabling local AI deployment
Key Concepts Explained Simply¶
RMSNorm (Root Mean Square Normalization)¶
LayerNorm normalizes by subtracting the mean and dividing by the standard deviation. RMSNorm is simpler — it skips the mean subtraction and only divides by the root mean square:
- LayerNorm: Subtract mean, divide by std, scale and shift
- RMSNorm: Just divide by RMS, then scale
Why? Removing the mean-centering step reduces computation with minimal impact on performance. At scale, this saving adds up.
SwiGLU Activation¶
The feed-forward network in a Transformer typically uses ReLU or GELU. SwiGLU uses a gated structure:
- Standard FFN: \(\text{FFN}(x) = W_2 \cdot \text{ReLU}(W_1 x)\)
- SwiGLU: \(\text{FFN}(x) = (W_1 x \odot \text{Swish}(W_3 x)) W_2\)
The gate \(\text{Swish}(W_3 x)\) controls what information flows through, giving the network more expressivity.
Rotary Position Embeddings (RoPE)¶
Instead of adding position information to the input (like sinusoidal encodings), RoPE rotates the query and key vectors based on position. The rotation angle is proportional to position, so the dot product between two rotated vectors depends on their relative position.
Key advantage: RoPE naturally encodes relative positions, and its extrapolation properties allow processing longer sequences than seen during training (with some quality loss).
The Math — Explained Step by Step¶
RMSNorm¶
Breaking it down:
- \(\|\mathbf{x}\|^2\): Sum of squared elements of the vector
- \(\frac{1}{d}\|\mathbf{x}\|^2\): Mean of squared elements (the "root mean square" before the sqrt)
- Divide by \(\sqrt{\text{mean of squares}}\): Normalizes the scale
- \(\odot \gamma\): Learnable element-wise scaling (no bias term)
SwiGLU¶
where \(\text{Swish}_\beta(x) = x \cdot \sigma(\beta x)\) and \(\sigma\) is the sigmoid function.
The \(\odot\) is element-wise multiplication — the Swish output gates the linear transformation.
RoPE¶
For position \(m\) and dimension pair \((2i, 2i+1)\):
where \(\theta_i = 10000^{-2i/d}\).
Key property: \(q_m^\top k_n = f(q, k, m-n)\) — the dot product depends on the relative distance \(m - n\), not absolute positions.
Python Implementation¶
import numpy as np
def rmsnorm(x, gamma, eps=1e-6):
"""
Root Mean Square Layer Normalization.
x: [seq_len, d_model]
gamma: [d_model] — learnable scale
"""
rms = np.sqrt(np.mean(x ** 2, axis=-1, keepdims=True) + eps)
return (x / rms) * gamma
def layernorm(x, gamma, beta, eps=1e-6):
"""Standard LayerNorm for comparison."""
mean = np.mean(x, axis=-1, keepdims=True)
var = np.var(x, axis=-1, keepdims=True)
return gamma * (x - mean) / np.sqrt(var + eps) + beta
def swish(x, beta=1.0):
"""Swish activation: x * sigmoid(beta * x)."""
return x * (1 / (1 + np.exp(-beta * x)))
def swiglu(x, W1, W3, W2):
"""
SwiGLU feed-forward network.
x: [seq_len, d_model]
W1: [d_model, d_ff] — gate projection
W3: [d_model, d_ff] — value projection
W2: [d_ff, d_model] — output projection
"""
gate = swish(x @ W1)
value = x @ W3
return (gate * value) @ W2
def rope_freqs(d_model, seq_len, base=10000):
"""Compute RoPE frequency pairs."""
freqs = 1.0 / (base ** (np.arange(0, d_model, 2).astype(float) / d_model))
positions = np.arange(seq_len)
angles = np.outer(positions, freqs)
return np.cos(angles), np.sin(angles)
def apply_rope(x, cos, sin):
"""
Apply Rotary Position Embeddings to query/key vectors.
x: [seq_len, d_model]
cos, sin: [seq_len, d_model//2]
"""
d = x.shape[-1]
x1 = x[:, :d // 2]
x2 = x[:, d // 2:]
x_rotated = np.concatenate([
x1 * cos - x2 * sin,
x1 * sin + x2 * cos
], axis=-1)
return x_rotated
def rope_attention(Q, K, V, cos, sin):
"""Self-attention with RoPE applied to Q and K."""
Q_rot = apply_rope(Q, cos, sin)
K_rot = apply_rope(K, cos, sin)
d_k = Q.shape[-1]
scores = (Q_rot @ K_rot.T) / np.sqrt(d_k)
# Causal mask
seq_len = Q.shape[0]
mask = np.triu(np.full((seq_len, seq_len), -1e9), k=1)
scores = scores + mask
attn = np.exp(scores - np.max(scores, axis=-1, keepdims=True))
attn = attn / np.sum(attn, axis=-1, keepdims=True)
return attn @ V
class LLaMABlock:
"""Simplified LLaMA Transformer block."""
def __init__(self, d_model, d_ff, n_heads):
self.d_model = d_model
self.d_ff = d_ff
self.n_heads = n_heads
self.W_q = np.random.randn(d_model, d_model) * 0.02
self.W_k = np.random.randn(d_model, d_model) * 0.02
self.W_v = np.random.randn(d_model, d_model) * 0.02
self.W_o = np.random.randn(d_model, d_model) * 0.02
self.W1 = np.random.randn(d_model, d_ff) * 0.02
self.W3 = np.random.randn(d_model, d_ff) * 0.02
self.W2 = np.random.randn(d_ff, d_model) * 0.02
self.gamma_attn = np.ones(d_model)
self.gamma_ffn = np.ones(d_model)
def forward(self, x, cos, sin):
# Pre-RMSNorm + Attention + Residual
h = rmsnorm(x, self.gamma_attn)
Q, K, V = h @ self.W_q, h @ self.W_k, h @ self.W_v
attn_out = rope_attention(Q, K, V, cos, sin)
attn_out = attn_out @ self.W_o
x = x + attn_out
# Pre-RMSNorm + SwiGLU FFN + Residual
h = rmsnorm(x, self.gamma_ffn)
ffn_out = swiglu(h, self.W1, self.W3, self.W2)
x = x + ffn_out
return x
# --- Demo ---
if __name__ == "__main__":
np.random.seed(42)
d_model, d_ff, n_heads, seq_len = 64, 128, 4, 10
# RMSNorm vs LayerNorm
x = np.random.randn(seq_len, d_model)
gamma = np.ones(d_model)
beta = np.zeros(d_model)
rms_out = rmsnorm(x, gamma)
ln_out = layernorm(x, gamma, beta)
print("RMSNorm output stats: mean={:.4f}, std={:.4f}".format(
rms_out.mean(), rms_out.std()))
print("LayerNorm output stats: mean={:.4f}, std={:.4f}".format(
ln_out.mean(), ln_out.std()))
# SwiGLU demo
W1 = np.random.randn(d_model, d_ff) * 0.02
W3 = np.random.randn(d_model, d_ff) * 0.02
W2 = np.random.randn(d_ff, d_model) * 0.02
ffn_out = swiglu(x, W1, W3, W2)
print(f"\nSwiGLU: input {x.shape} → output {ffn_out.shape}")
# RoPE demo
cos, sin = rope_freqs(d_model, seq_len)
Q = np.random.randn(seq_len, d_model)
Q_rot = apply_rope(Q, cos, sin)
print(f"RoPE: Q shape {Q.shape} → rotated {Q_rot.shape}")
# Full block
block = LLaMABlock(d_model, d_ff, n_heads)
out = block.forward(x, cos, sin)
print(f"\nLLaMA block: input {x.shape} → output {out.shape}")
# Model size comparison
print("\n--- LLaMA Model Sizes ---")
sizes = [
("LLaMA-7B", 7e9, 1.0e12, "1x A100 (with quantization)"),
("LLaMA-13B", 13e9, 1.0e12, "2x A100 or 1x A100 (quantized)"),
("LLaMA-33B", 33e9, 1.4e12, "4x A100"),
("LLaMA-65B", 65e9, 1.4e12, "8x A100"),
]
for name, params, tokens, hw in sizes:
ratio = tokens / params
print(f" {name}: {params/1e9:.0f}B params, {tokens/1e12:.1f}T tokens, "
f"{ratio:.0f} tok/param, {hw}")
Interview Importance¶
LLaMA is a top-5 most important model to know. Its architecture (RMSNorm + SwiGLU + RoPE) is the modern standard, and understanding why these choices were made demonstrates practical LLM knowledge.
Difficulty Level: ⭐⭐⭐ (Medium)¶
Interview Questions & Answers¶
Q1: Name three architectural choices in LLaMA vs. original GPT-2/3.¶
Answer: 1. RMSNorm instead of LayerNorm: Removes mean-centering for efficiency. Same stabilization effect with fewer operations. 2. SwiGLU instead of GELU: Gated activation in the FFN provides more expressivity through element-wise gating. 3. RoPE instead of learned/sinusoidal position embeddings: Encodes relative positions through rotation, enabling better length generalization.
Additional changes: Pre-normalization (normalize before sublayer, not after), no bias terms in linear layers, and grouped-query attention in Llama 2.
Q2: Connect LLaMA training to Chinchilla-optimal thinking.¶
Answer: LLaMA directly applied Chinchilla's insight that smaller models trained on more data outperform larger undertrained models: - LLaMA-7B: 1T tokens (~143 tokens/param) — overtrained beyond Chinchilla-optimal (~20 tok/param), intentionally, because smaller models are cheaper to serve - LLaMA-65B: 1.4T tokens (~21.5 tokens/param) — close to Chinchilla-optimal - Result: LLaMA-13B matched GPT-3 (175B) on benchmarks, validating the Chinchilla thesis for open models - Key insight: For inference-time efficiency, it's worth "overtraining" small models beyond the compute-optimal point
Q3: What deployment constraints push teams toward 7B/13B rather than 70B?¶
Answer: 1. GPU memory: 70B needs ~140GB in float16 (multiple high-end GPUs). 7B needs ~14GB (single consumer GPU with quantization) 2. Latency: Larger models have higher per-token latency due to more layers and wider matrices 3. Cost: Serving 70B at scale requires 8-10× more GPU resources than 7B 4. On-device deployment: Mobile/edge devices can run quantized 7B but not 70B 5. Fine-tuning cost: LoRA on 7B is fast and cheap; full fine-tuning on 70B requires a cluster 6. Quality is "good enough": For many tasks (classification, retrieval, simple generation), 7B with good fine-tuning matches 70B
Q4: Explain RoPE and why it's better than learned positional embeddings.¶
Answer: RoPE encodes positions by rotating query and key vectors. The dot product between a rotated query at position \(m\) and a rotated key at position \(n\) depends only on their relative distance \(m-n\).
Advantages over learned embeddings: 1. Relative positions: Naturally captures relative distance, not just absolute position 2. Length generalization: Can extrapolate to positions beyond the training length (though with quality degradation) 3. No extra parameters: Position information is encoded through rotation, not learned vectors 4. Theoretical grounding: Based on Fourier analysis of position-dependent signals
Connections to Other Papers¶
- Chinchilla → LLaMA uses Chinchilla-optimal data ratios
- Transformer → Base architecture with modern improvements
- GPT-2/3 → LLaMA modernizes the GPT decoder-only design
- LoRA → Commonly fine-tuned with LoRA (parameter-efficient)
- Mistral → Further innovations (GQA, sliding window) on LLaMA-style architecture
Key Takeaways for Quick Review¶
| Concept | Remember |
|---|---|
| Architecture | Decoder-only with RMSNorm, SwiGLU, RoPE |
| Sizes | 7B, 13B, 33B, 65B |
| Training tokens | 1T-1.4T (Chinchilla-optimal or beyond) |
| Key result | 13B matches GPT-3 (175B) on many benchmarks |
| RMSNorm | LayerNorm without mean subtraction |
| SwiGLU | Gated FFN activation for expressivity |
| RoPE | Rotary embeddings for relative position encoding |
| Impact | Triggered open-source LLM ecosystem |