Chain-of-Thought Prompting Elicits Reasoning in Large Language Models¶
Authors: Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le
Year: 2022 | Venue: NeurIPS
Link: arXiv:2201.11903
TL;DR¶
Chain-of-thought (CoT) prompting appends intermediate reasoning steps before the final answer — either via few-shot examples with rationales or the zero-shot trigger "Let's think step by step." Large models show dramatic gains on math, commonsense, and symbolic reasoning tasks where direct answers fail. The effect is emergent — it helps large models significantly but barely affects small ones.
Why This Paper Matters¶
CoT is the default approach for reasoning tasks with LLMs:
- Simple yet powerful: Just adding "Let's think step by step" can improve accuracy by 20-40% on math tasks
- Foundation for agents: CoT traces are the "Thought" component in ReAct agents
- Self-consistency: Sampling multiple CoT paths and voting on the answer further improves reliability
- Verifiers and RL: CoT traces can be scored and used for reward-based training
- Universal adoption: Every major LLM provider uses CoT internally and in prompts
Key Concepts Explained Simply¶
Why Direct Answers Fail¶
Consider: "Roger has 5 tennis balls. He buys 2 cans with 3 balls each. How many does he have now?"
- Direct answer: The model might say "11" (wrong — it's 5 + 2×3 = 11, actually correct, but for harder problems, models fail without intermediate steps)
- With CoT: "Roger starts with 5 balls. He buys 2 cans with 3 balls each, so 2 × 3 = 6 new balls. Total: 5 + 6 = 11."
For complex problems, the model needs to "show its work" — decompose the problem into manageable steps.
Few-Shot CoT¶
Provide examples that include reasoning steps:
Q: Tom has 3 apples. He gives away 1 and buys 5 more. How many?
A: Tom starts with 3 apples. He gives away 1, leaving 3-1=2. He buys 5 more: 2+5=7. The answer is 7.
Q: [Your question]
A:
Zero-Shot CoT¶
Just append "Let's think step by step." to the prompt — no examples needed. Surprisingly effective.
Self-Consistency¶
Instead of generating one CoT path, generate multiple (e.g., 40) with temperature > 0, extract the final answer from each, and take the majority vote. Different reasoning paths may make different errors, but the correct answer appears most often.
The Math — Explained Step by Step¶
CoT as Latent Variable¶
A chain-of-thought \(z\) is a latent intermediate computation between question \(x\) and answer \(a\):
Breaking it down:
- \(P_\theta(z \mid x)\): Probability of generating reasoning trace \(z\) given the question
- \(P_\theta(a \mid x, z)\): Probability of the final answer given both the question and the reasoning
- Marginalizing over all possible \(z\) is intractable — we sample instead
Self-Consistency¶
Sample \(K\) reasoning chains \(z_1, \ldots, z_K\) and extract answers \(a_1, \ldots, a_K\):
Majority vote over sampled answers. This approximates marginalizing over reasoning paths.
Why CoT Helps (Computational Perspective)¶
From a complexity standpoint, Transformer forward passes compute a fixed-depth circuit. CoT effectively increases the computational depth:
- Without CoT: \(O(L)\) serial computation (fixed number of layers)
- With CoT of length \(T\): \(O(L \times T)\) effective computation — each generated token adds another full forward pass
This lets the model "think for longer" on harder problems.
Python Implementation¶
import numpy as np
from collections import Counter
def zero_shot_cot(question):
"""Zero-shot CoT: append the magic phrase."""
return f"Q: {question}\nA: Let's think step by step.\n"
def few_shot_cot(question, examples):
"""
Few-shot CoT: provide examples with reasoning traces.
examples: list of (question, reasoning, answer) tuples
"""
prompt = ""
for q, reasoning, ans in examples:
prompt += f"Q: {q}\n"
prompt += f"A: {reasoning} The answer is {ans}.\n\n"
prompt += f"Q: {question}\nA:"
return prompt
def extract_answer(cot_response):
"""Extract the final answer from a CoT response."""
markers = ["the answer is", "therefore", "so the answer is", "= "]
response_lower = cot_response.lower()
for marker in markers:
if marker in response_lower:
idx = response_lower.rfind(marker) + len(marker)
answer = cot_response[idx:].strip().rstrip(".")
try:
return float(answer)
except ValueError:
return answer
return cot_response.strip().split()[-1]
def self_consistency(question, generate_fn, n_samples=10, temperature=0.7):
"""
Self-consistency: sample multiple CoT paths, majority vote on answer.
generate_fn: function that takes a prompt and returns a response
"""
answers = []
traces = []
for _ in range(n_samples):
prompt = zero_shot_cot(question)
response = generate_fn(prompt, temperature=temperature)
answer = extract_answer(response)
answers.append(answer)
traces.append(response)
# Majority vote
counter = Counter(answers)
best_answer, best_count = counter.most_common(1)[0]
confidence = best_count / n_samples
return {
"answer": best_answer,
"confidence": confidence,
"vote_distribution": dict(counter),
"n_samples": n_samples,
}
def simulate_cot_accuracy(model_size_B, use_cot=False, task="math"):
"""
Simulate the emergence of CoT: large models benefit, small ones don't.
"""
base_accuracies = {
"math": {0.3: 0.05, 1: 0.08, 7: 0.15, 13: 0.25, 70: 0.40, 175: 0.50, 540: 0.58},
"commonsense": {0.3: 0.30, 1: 0.40, 7: 0.55, 13: 0.62, 70: 0.72, 175: 0.78, 540: 0.82},
}
cot_boosts = {
"math": {0.3: 0.00, 1: 0.01, 7: 0.05, 13: 0.12, 70: 0.25, 175: 0.30, 540: 0.35},
"commonsense": {0.3: 0.00, 1: 0.01, 7: 0.03, 13: 0.08, 70: 0.12, 175: 0.14, 540: 0.15},
}
sizes = list(base_accuracies[task].keys())
closest = min(sizes, key=lambda s: abs(s - model_size_B))
base = base_accuracies[task][closest]
if use_cot:
return base + cot_boosts[task][closest]
return base
def cot_cost_analysis(prompt_tokens, answer_tokens_direct, answer_tokens_cot,
cost_per_1k_tokens=0.002):
"""Compare token costs of direct vs CoT responses."""
direct_total = prompt_tokens + answer_tokens_direct
cot_total = prompt_tokens + answer_tokens_cot
return {
"direct_tokens": direct_total,
"cot_tokens": cot_total,
"overhead_ratio": cot_total / direct_total,
"direct_cost": direct_total / 1000 * cost_per_1k_tokens,
"cot_cost": cot_total / 1000 * cost_per_1k_tokens,
}
# --- Demo ---
if __name__ == "__main__":
np.random.seed(42)
# Zero-shot CoT
print("--- Zero-Shot CoT ---")
print(zero_shot_cot("If a shirt costs $25 and is on 20% sale, what do you pay?"))
# Few-shot CoT
print("\n--- Few-Shot CoT ---")
examples = [
("A book costs $15. Tax is 10%. What's the total?",
"The book is $15. Tax is 10% of 15 = $1.50. Total = 15 + 1.50 = $16.50.",
"$16.50"),
("You have 3 bags with 4 marbles each, plus 2 loose marbles. How many total?",
"3 bags × 4 marbles = 12 marbles. Plus 2 loose = 12 + 2 = 14.",
"14"),
]
print(few_shot_cot(
"A restaurant bill is $80. You tip 15%. What's the total?",
examples
))
# Emergence across model sizes
print("\n--- CoT Emergence (Math Accuracy by Model Size) ---")
print(f"{'Model Size':>12} {'Direct':>10} {'With CoT':>10} {'Boost':>10}")
print("-" * 45)
for size in [0.3, 1, 7, 13, 70, 175, 540]:
direct = simulate_cot_accuracy(size, use_cot=False, task="math")
cot = simulate_cot_accuracy(size, use_cot=True, task="math")
boost = cot - direct
print(f"{size:>10.1f}B {direct:>10.0%} {cot:>10.0%} {'+' + f'{boost:.0%}':>10}")
# Cost analysis
print("\n--- CoT Cost Overhead ---")
cost = cot_cost_analysis(
prompt_tokens=200,
answer_tokens_direct=10,
answer_tokens_cot=150,
cost_per_1k_tokens=0.002
)
print(f"Direct: {cost['direct_tokens']} tokens (${cost['direct_cost']:.4f})")
print(f"CoT: {cost['cot_tokens']} tokens (${cost['cot_cost']:.4f})")
print(f"Overhead: {cost['overhead_ratio']:.1f}× more tokens")
# Self-consistency simulation
print("\n--- Self-Consistency Simulation ---")
def mock_generate(prompt, temperature=0.7):
correct_prob = 0.6
if np.random.random() < correct_prob:
return "Step 1: ... Step 2: ... The answer is 42."
else:
return f"Step 1: ... The answer is {np.random.choice([40, 41, 43, 44])}."
result = self_consistency("What is 6×7?", mock_generate, n_samples=20)
print(f"Answer: {result['answer']}")
print(f"Confidence: {result['confidence']:.0%}")
print(f"Vote distribution: {result['vote_distribution']}")
Interview Importance¶
CoT is a must-know for any role involving LLM applications. It's the standard approach for reasoning tasks and the foundation for agent architectures.
Difficulty Level: ⭐⭐ (Medium)¶
Interview Questions & Answers¶
Q1: When does CoT hurt?¶
Answer: 1. Simple tasks: CoT adds verbosity without accuracy gain on easy questions (e.g., sentiment classification) 2. Hallucinated reasoning: The model may generate plausible-sounding but incorrect reasoning steps, leading to wrong answers with false confidence 3. Latency and cost: CoT generates 5-20× more tokens, increasing response time and API costs 4. Small models: Models below ~10B parameters often don't benefit from CoT and may even perform worse 5. Unfaithful chains: The reasoning may not reflect the model's actual computation — it might arrive at the answer through different internal mechanisms
Q2: Explain self-consistency decoding at a high level.¶
Answer: Self-consistency samples multiple diverse reasoning paths (e.g., 40 chains) using temperature sampling, extracts the final answer from each path, and takes the majority vote. The intuition: there may be many different valid reasoning paths to the correct answer, but fewer paths to any specific wrong answer. Sampling diversity ensures different errors cancel out while correct answers reinforce each other. It typically improves accuracy by 5-15% over single-sample CoT.
Q3: How would you evaluate reasoning besides final-answer accuracy?¶
Answer: 1. Step-by-step verification: Check each reasoning step for correctness (process reward models) 2. Faithfulness: Compare the reasoning trace to the model's attention patterns — does the trace reflect actual computation? 3. Counterfactual testing: If you change a number in the problem, does the reasoning correctly propagate the change? 4. Error categorization: Classify errors as arithmetic, logical, misunderstanding, or hallucination 5. Human judgment: Experts rate reasoning quality independent of final answer correctness
Q4: How does CoT relate to verifiers and reward models?¶
Answer: CoT traces provide inspectable intermediate work that can be scored: 1. Process Reward Models (PRMs): Score each step individually, rewarding correct reasoning steps and penalizing incorrect ones 2. Outcome Reward Models (ORMs): Score only the final answer — simpler but less informative 3. RL on reasoning: Use step-level rewards to train models that produce better reasoning traces 4. Best-of-N: Generate N CoT traces, score each with a verifier, return the highest-scoring one (better than majority vote when the verifier is good)
Connections to Other Papers¶
- GPT-3 → CoT enhances GPT-3's few-shot capabilities with reasoning
- ReAct → CoT + tool use = agent reasoning
- FLAN → CoT data included in instruction tuning mixes
- InstructGPT → RLHF can be applied to CoT traces
- Toolformer → CoT reasoning decides when to call tools
Key Takeaways for Quick Review¶
| Concept | Remember |
|---|---|
| Core idea | Add reasoning steps before the final answer |
| Zero-shot | "Let's think step by step." |
| Few-shot | Provide examples with reasoning traces |
| Self-consistency | Sample multiple paths, majority vote on answer |
| Emergence | Works well only for large models (>10B) |
| Cost | 5-20× more tokens than direct answers |
| Limitation | Can hallucinate plausible but wrong reasoning |
| Foundation for | Agents (ReAct), verifiers (PRM), reasoning RL |