RoPE: Rotary Position Embedding (RoFormer)¶
Original Authors: Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, Yunfeng Liu Timeline: RoFormer (2021) → RoPE becomes standard in LLaMA, Mistral, Qwen, and most modern LLMs Links: arXiv:2104.09864
TL;DR¶
RoPE (Rotary Position Embedding) encodes positional information by rotating query and key vectors in 2D subspaces using position-dependent angles. Unlike additive positional encodings (sinusoidal, learned), RoPE multiplicatively mixes position into attention arguments, enabling relative position awareness without any learned parameters. This gives excellent length extrapolation, decay of attention with distance, and became the default positional encoding for modern LLMs (LLaMA, Mistral, Qwen, PaLM, and many more).
Why This Paper Matters¶
RoPE is now the industry standard for positional encoding in autoregressive language models:
- Relative position by design: Dot product of rotated Q and K naturally encodes relative distance \(m - n\)
- No learned parameters: Fixed trigonometric functions, zero training cost
- Excellent extrapolation: Works well beyond training sequence length (with scaling tricks)
- Adoption: Used by LLaMA, Mistral, Qwen, PaLM, GLM, and virtually every open-weight model since 2023
- Extensions: NTK-aware scaling, YaRN, and position interpolation all build on RoPE's foundation
- Interview essential: Expected knowledge for any LLM architecture or systems role
Key Concepts Explained Simply¶
The Position Problem in Transformers¶
Self-attention is permutation-equivariant — shuffling input tokens and running attention produces the same result as running attention then shuffling outputs. Without position, "the cat sat on the mat" is indistinguishable from "mat the on sat cat the."
Early solutions added position vectors to token embeddings (sinusoidal, learned). RoPE takes a different approach: rotate queries and keys based on their position.
RoPE's Insight: Rotation Encodes Relative Position¶
Imagine two 2D vectors. If you rotate both by the same angle, their dot product stays the same. But if you rotate them by different angles (proportional to their positions), the dot product depends on the difference in rotation — which corresponds to relative position.
RoPE extends this to high dimensions by treating each head dimension as multiple 2D planes, each rotating at different frequencies.
Why Only Q and K, Not V?¶
Attention scores come from \(QK^\top\) — only queries and keys determine which tokens to attend to. Values (V) determine what information to extract once attention weights are computed. Position affects where to attend, not what to attend to, so rotating only Q and K is sufficient.
The Math — Explained Step by Step¶
RoPE Formulation¶
For a token at position \(m\) with query vector \(q_m \in \mathbb{R}^d\) (where \(d\) is even), RoPE applies a rotation in each 2D subspace:
For frequency index \(i \in \{0, 1, \ldots, d/2 - 1\}\), define the base frequency:
The rotation angle for position \(m\) in subspace \(i\) is \(m \theta_i\). For each pair of dimensions \((2i, 2i+1)\):
Similarly for keys \(k_n\).
Dot Product Reveals Relative Position¶
After rotation, the dot product between rotated query at position \(m\) and key at position \(n\) is:
The key insight: the angle depends on \(m - n\), the relative position difference, even though we applied absolute position rotations. This naturally encodes relative position without computing it explicitly.
Frequency Decay and Locality Bias¶
The geometric progression \(\theta_i = 10000^{-2i/d}\) means: - Low indices (\(i\) small): large \(\theta_i\), fast rotation → capture short-range dependencies - High indices (\(i\) large): small \(\theta_i\), slow rotation → capture long-range dependencies
This creates a natural decay: tokens far apart have larger angular differences, typically reducing cosine similarity and thus attention weight. This gives RoPE an implicit locality bias without any hard constraints.
Python Implementation¶
import numpy as np
import torch
import torch.nn as nn
def build_rope_frequencies(d: int, base: int = 10000) -> np.ndarray:
"""
Build frequency array for RoPE.
theta_i = base^(-2i/d) for i = 0, 1, ..., d/2-1
Args:
d: head dimension (must be even)
base: base of geometric progression (default 10000)
Returns:
freqs: array of shape (d//2,) with frequencies theta_i
"""
assert d % 2 == 0, "RoPE requires even head dimension"
i = np.arange(d // 2)
freqs = base ** (-2 * i / d)
return freqs
def apply_rope_2d(x: np.ndarray, freqs: np.ndarray) -> np.ndarray:
"""
Apply RoPE to input x using frequencies.
Args:
x: array of shape (..., d) where d is even
freqs: array of shape (d//2,) with frequencies
Returns:
x_rotated: array of shape (..., d) with rotations applied
"""
# Reshape into pairs: (..., d//2, 2)
x_reshaped = x.reshape(*x.shape[:-1], -1, 2)
# Compute rotation angles for each position
# For simplicity, assume x has position info in batch dim
# In practice, positions come from token indices
freqs_expanded = freqs[np.newaxis, :] # (1, d//2)
# Extract cos and sin
cos_vals = np.cos(freqs_expanded) # (1, d//2)
sin_vals = np.sin(freqs_expanded) # (1, d//2)
# Apply rotation in each 2D subspace
# For pair (x_0, x_1): rotate by angle theta
# x_0' = x_0 * cos(theta) - x_1 * sin(theta)
# x_1' = x_0 * sin(theta) + x_1 * cos(theta)
x0 = x_reshaped[..., 0] # (..., d//2)
x1 = x_reshaped[..., 1] # (..., d//2)
x0_rot = x0 * cos_vals - x1 * sin_vals
x1_rot = x0 * sin_vals + x1 * cos_vals
# Stack back: (..., d//2, 2) -> (..., d)
x_rotated = np.stack([x0_rot, x1_rot], axis=-1).reshape(x.shape)
return x_rotated
def rope_attention(Q: np.ndarray, K: np.ndarray, V: np.ndarray,
base: int = 10000) -> np.ndarray:
"""
Full attention with RoPE applied to Q and K.
Args:
Q: query array (B, H, N, d_k)
K: key array (B, H, N, d_k)
V: value array (B, H, N, d_k)
base: base for frequency computation
Returns:
O: attention output (B, H, N, d_k)
"""
B, H, N, d_k = Q.shape
# Build frequencies
freqs = build_rope_frequencies(d_k, base)
# Apply RoPE to Q and K at each position
# Reshape for broadcasting: (N, 1, d_k) -> apply freqs per position
positions = np.arange(N)
angles = np.outer(positions, freqs) # (N, d_k//2)
cos_vals = np.cos(angles) # (N, d_k//2)
sin_vals = np.sin(angles) # (N, d_k//2)
# Reshape Q, K to apply rotation: (B, H, N, d_k//2, 2)
Q_reshaped = Q.reshape(B, H, N, -1, 2)
K_reshaped = K.reshape(B, H, N, -1, 2)
# Apply rotation to Q
Q0, Q1 = Q_reshaped[..., 0], Q_reshaped[..., 1]
cos_B = cos_vals[np.newaxis, np.newaxis, :, :] # (1, 1, N, d_k//2)
sin_B = sin_vals[np.newaxis, np.newaxis, :, :]
Q0_rot = Q0 * cos_B - Q1 * sin_B
Q1_rot = Q0 * sin_B + Q1 * cos_B
Q_rot = np.stack([Q0_rot, Q1_rot], axis=-1).reshape(B, H, N, d_k)
# Apply rotation to K
K0, K1 = K_reshaped[..., 0], K_reshaped[..., 1]
K0_rot = K0 * cos_B - K1 * sin_B
K1_rot = K0 * sin_B + K1 * cos_B
K_rot = np.stack([K0_rot, K1_rot], axis=-1).reshape(B, H, N, d_k)
# Compute attention with rotated Q and K
d_k_sqrt = np.sqrt(d_k)
scores = (Q_rot @ K_rot.transpose(0, 1, 3, 2)) / d_k_sqrt
# Causal mask
mask = np.triu(np.full((N, N), -np.inf), k=1)
scores = scores + mask[np.newaxis, np.newaxis, :, :]
# Softmax
scores_max = np.max(scores, axis=-1, keepdims=True)
scores_exp = np.exp(scores - scores_max)
attn = scores_exp / np.sum(scores_exp, axis=-1, keepdims=True)
# Output
O = attn @ V
return O
class RotaryPE(nn.Module):
"""PyTorch implementation of RoPE for queries and keys."""
def __init__(self, d_k: int, max_len: int = 8192, base: int = 10000):
super().__init__()
if d_k % 2 != 0:
raise ValueError("RoPE requires even head dimension")
self.d_k = d_k
self.max_len = max_len
# Precompute frequencies: theta_i = base^(-2i/d)
i = torch.arange(d_k // 2)
freqs = base ** (-2 * i / d_k)
# Precompute angles for all positions: (max_len, d_k//2)
positions = torch.arange(max_len).unsqueeze(1) # (max_len, 1)
angles = positions * freqs.unsqueeze(0) # (max_len, d_k//2)
# Register as buffers (not parameters, but move with model)
self.register_buffer("cos", torch.cos(angles))
self.register_buffer("sin", torch.sin(angles))
def forward(self, q: torch.Tensor, k: torch.Tensor,
offset: int = 0) -> tuple[torch.Tensor, torch.Tensor]:
"""
Apply RoPE to Q and K.
Args:
q: queries (B, H, T, d_k)
k: keys (B, H, T, d_k)
offset: position offset for incremental decoding
Returns:
q_rot, k_rot: rotated queries and keys
"""
# Get cos/sin for this position range
T = q.shape[2]
cos = self.cos[offset:offset+T, :] # (T, d_k//2)
sin = self.sin[offset:offset+T, :]
# Reshape for broadcasting
cos = cos.unsqueeze(0).unsqueeze(0) # (1, 1, T, d_k//2)
sin = sin.unsqueeze(0).unsqueeze(0)
# Apply rotation using interleaved pairs
q_rot = self._rotate_pairs(q, cos, sin)
k_rot = self._rotate_pairs(k, cos, sin)
return q_rot, k_rot
def _rotate_pairs(self, x: torch.Tensor, cos: torch.Tensor,
sin: torch.Tensor) -> torch.Tensor:
"""Apply 2D rotation to interleaved pairs in x."""
# Split into even and odd dimensions
x_even = x[..., ::2] # (B, H, T, d_k//2)
x_odd = x[..., 1::2] # (B, H, T, d_k//2)
# Rotate
x_even_rot = x_even * cos - x_odd * sin
x_odd_rot = x_even * sin + x_odd * cos
# Interleave back
x_rot = torch.stack([x_even_rot, x_odd_rot], dim=-1)
x_rot = x_rot.reshape(x.shape)
return x_rot
def verify_relative_position_property(d_k: int = 64, base: int = 10000):
"""
Verify that RoPE dot products depend on relative position (m - n).
"""
freqs = build_rope_frequencies(d_k, base)
# Create random Q and K vectors
np.random.seed(42)
q_vec = np.random.randn(d_k)
k_vec = np.random.randn(d_k)
# Test: dot product at positions (m, n) should equal dot product at (m+delta, n+delta)
m, n = 10, 5
delta = 7
# Rotate at (m, n)
q_m = apply_rope_2d(q_vec[np.newaxis, :], freqs * m)[0]
k_n = apply_rope_2d(k_vec[np.newaxis, :], freqs * n)[0]
dot_mn = np.dot(q_m, k_n)
# Rotate at (m+delta, n+delta)
q_m_delta = apply_rope_2d(q_vec[np.newaxis, :], freqs * (m + delta))[0]
k_n_delta = apply_rope_2d(k_vec[np.newaxis, :], freqs * (n + delta))[0]
dot_m_delta_n_delta = np.dot(q_m_delta, k_n_delta)
print(f"Dot product at positions ({m}, {n}): {dot_mn:.6f}")
print(f"Dot product at positions ({m+delta}, {n+delta}): {dot_m_delta_n_delta:.6f}")
print(f"Difference (should be ~0): {abs(dot_mn - dot_m_delta_n_delta):.2e}")
# Now test different relative positions
m2, n2 = 15, 10 # Same relative distance: m2 - n2 = 5 = m - n
q_m2 = apply_rope_2d(q_vec[np.newaxis, :], freqs * m2)[0]
k_n2 = apply_rope_2d(k_vec[np.newaxis, :], freqs * n2)[0]
dot_m2_n2 = np.dot(q_m2, k_n2)
print(f"\nDot product at positions ({m2}, {n2}): {dot_m2_n2:.6f}")
print(f"Relative position (m-n): {m-n}, (m2-n2): {m2-n2}")
print(f"Difference (should be ~0): {abs(dot_mn - dot_m2_n2):.2e}")
def compare_extrapolation_methods():
"""
Demonstrate different RoPE extrapolation techniques.
"""
d_k = 64
base = 10000
train_len = 2048
test_len = 8192
freqs = build_rope_frequencies(d_k, base)
print("=" * 70)
print("RoPE Extrapolation Methods")
print("=" * 70)
# 1. Original RoPE (no modification)
print("\n1. Original RoPE:")
print(f" Train length: {train_len}, Test length: {test_len}")
print(f" Max train angle (pos {train_len}): {train_len * freqs[0]:.2f} rad")
print(f" Max test angle (pos {test_len}): {test_len * freqs[0]:.2f} rad")
print(f" Issue: Test angles 4x larger than training distribution")
# 2. Position Interpolation (PI)
scale = train_len / test_len
print(f"\n2. Position Interpolation:")
print(f" Scale positions by {scale:.3f}")
print(f" Effective test angles: {test_len * scale * freqs[0]:.2f} rad")
print(f" Stays in training distribution, but compresses all frequencies")
# 3. NTK-aware Scaling
ntk_factor = (test_len / train_len) ** (d_k / (d_k - 2))
effective_base = base * ntk_factor
print(f"\n3. NTK-aware Scaling:")
print(f" Effective base: {effective_base:.0f} (original: {base})")
print(f" Rescales frequencies to maintain neural tangent kernel stability")
print(f" Less aggressive than PI on high frequencies")
# 4. YaRN (Yet another RoPE extensioN)
print(f"\n4. YaRN:")
print(f" Combines PI with selective frequency targeting")
print(f" Some frequency bands interpolated more aggressively")
print(f" Preserves short-context behavior while enabling long context")
# --- Demo ---
if __name__ == "__main__":
print("=" * 70)
print("RoPE: Rotary Position Embedding")
print("=" * 70)
# 1. Verify relative position property
print("\n--- Verifying Relative Position Property ---")
verify_relative_position_property(d_k=64)
# 2. Compare extrapolation methods
print("\n" + "=" * 70)
compare_extrapolation_methods()
# 3. PyTorch RoPE test
print("\n--- PyTorch RoPE Shape Test ---")
rope = RotaryPE(d_k=64, max_len=4096)
B, H, T = 2, 8, 128
q = torch.randn(B, H, T, 64)
k = torch.randn(B, H, T, 64)
q_rot, k_rot = rope(q, k)
print(f"Q shape: {q.shape} -> Q_rot shape: {q_rot.shape}")
print(f"K shape: {k.shape} -> K_rot shape: {k_rot.shape}")
print(f"Q and Q_rot have same shape: {q.shape == q_rot.shape}")
# 4. Incremental decoding test
print("\n--- Incremental Decoding Test ---")
# First pass: positions 0-127
q1, k1 = rope(q, k, offset=0)
# Second pass: positions 128-255 (continuation)
q2, k2 = rope(q, k, offset=128)
print(f"First pass uses positions 0-127, second pass uses 128-255")
print(f"RoPE correctly handles incremental decoding with offset")
Length Extrapolation: The RoPE Scaling Problem¶
The Core Issue¶
RoPE is trained on sequences up to length \(L_{\text{train}}\). At inference, if you evaluate at position \(L_{\text{test}} \gg L_{\text{train}}\):
- Angles \(m \theta_i\) exceed the range seen during training
- High-frequency components oscillate rapidly, creating out-of-distribution attention patterns
- Model quality degrades
Solution 1: Position Interpolation (PI)¶
Scale all positions down to fit in training range:
Pros: Simple, keeps angles in-distribution. Cons: Compresses all frequencies equally, losing resolution on short-range patterns the model learned.
Solution 2: NTK-Aware Scaling¶
Adjust the base \(10000\) to \(10000 \cdot \alpha\) where \(\alpha\) depends on the extrapolation ratio:
This comes from analyzing the Neural Tangent Kernel (NTK) of the RoPE-modified attention layer. The insight: scaling the base keeps the optimization geometry stable at longer lengths.
Pros: Preserves high-frequency behavior better than PI. Cons: Still a heuristic; requires tuning the scaling factor.
Solution 3: YaRN (Yet another RoPE extensioN)¶
YaRN combines position interpolation with frequency-aware blending:
Low frequencies (long-range) are scaled aggressively; high frequencies (short-range) are left unchanged to preserve local patterns.
Pros: Best empirical results for extreme extrapolation (4K → 128K). Cons: More complex; requires choosing threshold \(\tau\).
Interview Importance¶
RoPE is a top-10 architecture topic in LLM interviews. Any role involving model architecture, training, or serving expects you to understand positional encoding choices.
Difficulty Level: ⭐⭐⭐⭐ (Hard)¶
Interview Questions & Answers¶
Q1: Why did RoPE replace sinusoidal and learned positional encodings?¶
Answer: Three reasons: 1. Relative position by construction: Sinusoidal encodings approximate relative position through additive patterns; RoPE encodes it exactly in the dot product \(q_m^\top k_n\) depends on \(m - n\). 2. Zero parameters: Learned positions add \(L_{\max} \times d\) parameters and fail to extrapolate. RoPE uses fixed trig functions — no training cost, no overfitting. 3. Better extrapolation: With scaling tricks (NTK, YaRN), RoPE generalizes to 4-32× training length. Learned positions have no principled behavior beyond \(L_{\max}\).
Q2: Explain why RoPE only rotates Q and K, not V.¶
Answer: Attention computes \(\text{softmax}(QK^\top / \sqrt{d_k}) V\). The \(QK^\top\) term determines where to attend (which positions are relevant) — this is where position matters. The \(V\) term determines what information to extract once attention weights are computed. Position affects the "where" (relative distance between tokens), not the "what" (token content). Rotating only Q and K makes attention scores position-aware while keeping values as pure content representations.
Q3: What happens if you train on 4K tokens and evaluate on 32K without any RoPE scaling?¶
Answer: The rotation angles at position 32K are 8× larger than the maximum angle seen during training (position 4K). High-frequency components (\(\theta_i\) for small \(i\)) oscillate so rapidly that: - Dot products between distant tokens become noise-like (random phase differences) - The attention distribution becomes unpredictable — the model never saw this geometry - Perplexity degrades significantly, and generated text quality drops
Q4: How does position interpolation enable longer context?¶
Answer: Position interpolation rescales positions: instead of feeding position \(m\) into RoPE, feed \(m \cdot (L_{\text{train}} / L_{\text{test}})\). This compresses the 32K positions into the 4K angle range the model was trained on. The angles stay in-distribution, so the model's learned attention patterns still apply. The trade-off is reduced resolution — adjacent positions are closer in angle space, potentially blurring fine-grained position distinctions.
Q5: What is the relationship between RoPE and relative position biases like T5's?¶
Answer: Both encode relative position, but differently: - T5 relative bias: Adds a learned scalar bias \(b_{m-n}\) to attention logits based on relative distance. Requires a lookup table of biases. - RoPE: Multiplies Q and K by rotation matrices. The relative position emerges from the geometry of rotated vectors — no learned table needed.
RoPE is more parameter-efficient and integrates directly into the attention computation rather than adding a post-hoc bias.
Connections to Other Papers¶
- Transformer → RoPE replaces the original sinusoidal positional encoding
- LLaMA → Adopted RoPE as the standard positional encoding for open-weight models
- Mistral → Uses RoPE with sliding window attention
- DeepSeek-V2 → Decoupled RoPE pathway separates content and position in MLA
- Qwen2.5 → Uses RoPE as part of the modern decoder stack
- PaLM → Also adopted RoPE for position encoding
- YaRN / NTK scaling → Extensions enabling RoPE extrapolation to 128K+ context
Key Takeaways for Quick Review¶
| Concept | Remember |
|---|---|
| Core idea | Rotate Q and K by position-dependent angles in 2D subspaces |
| Key property | Dot product depends on relative position \(m - n\) |
| Frequency | \(\theta_i = 10000^{-2i/d}\); geometric progression |
| Parameters | None — fixed trigonometric functions |
| Rotation scope | Only Q and K, not V |
| Extrapolation | Works well with scaling: PI, NTK-aware, YaRN |
| Adoption | LLaMA, Mistral, Qwen, PaLM, GLM, virtually all modern LLMs |
| Advantage over learned PE | No parameters, better extrapolation, no out-of-vocabulary positions |
| Advantage over sinusoidal | Exact relative position, not approximate |
| Interview angle | Essential architecture knowledge; expect "why RoPE?" questions |