Consensus Algorithms (Raft & Paxos)¶

What Is Consensus?¶

Consensus means getting several independent nodes to agree on a single value — and to stay agreed even when some nodes crash, the network drops or delays messages, and no node can tell the difference between "that node is slow" and "that node is dead."

In Designing Data-Intensive Applications (Chapter 9), Martin Kleppmann describes the goal informally as: "get several nodes to agree on something." That sounds trivial — until you remember that in a distributed system there is no shared clock and no reliable way to know whether another node is alive. A consensus algorithm is a protocol that guarantees the cluster reaches one decision that all surviving nodes honor, so the system behaves as if it had a single, reliable brain.

A correct consensus algorithm must satisfy four properties (DDIA, Ch. 9):

Property	Definition
Uniform agreement	No two nodes decide differently (this is the core safety guarantee)
Integrity	No node decides twice
Validity	The decided value was actually proposed by some node
Termination	Every node that does not crash eventually decides (the liveness guarantee)

The first three say nothing bad happens — the cluster never disagrees with itself. The last says something good eventually happens — the cluster doesn't get stuck. Termination requires a majority (quorum) of nodes to be alive: a cluster of 2f + 1 nodes can tolerate f failures.

Why Do We Need Consensus?¶

Many of the hardest problems in distributed systems turn out to be the same problem wearing different clothes. Kleppmann shows the following are all reducible to consensus — solve one and you've solved them all:

Leader election. A replicated database needs exactly one leader to accept writes. If a network partition makes two nodes both believe they are leader, you get split brain — conflicting writes and corrupted data. Consensus ensures the cluster agrees on a single leader.
Linearizable / atomic operations. A distributed lock, a unique-username registration, or a compare-and-set all require that, among concurrent attempts, exactly one succeeds and everyone agrees on which one.
Atomic commit. When a transaction spans nodes, all of them must either commit or abort — they must agree on the outcome (see Distributed Transactions).
Replicated state machines (total order broadcast). If every replica applies the same operations in the same order, they all end up in the same state. Agreeing on that single order is, again, consensus.

The reason this is genuinely hard — and not just "take a vote" — is the FLP impossibility result:

FLP Impossibility (Fischer, Lynch, Paterson, 1985)

In a purely asynchronous system (no bounds on message delay) where even one node may crash, no deterministic algorithm can guarantee consensus will always terminate. There is no way to reliably distinguish a crashed node from a slow one, so an algorithm can be forced to wait forever.

Real systems escape this not by breaking the theorem but by weakening the assumptions: they use timeouts and randomization to make progress in practice, preserving safety always and liveness as long as the network is "well-behaved enough." This is exactly what Raft's randomized election timeouts and Paxos's leader-based optimization do.

This is why consensus is the theoretical bedrock of every strongly consistent system — Spanner, CockroachDB, etcd, ZooKeeper, Kafka's controller. They don't avoid the hard problem; they pay for a correct, off-the-shelf solution to it.

Why Consensus Matters in Interviews¶

At L6, interviewers expect you to articulate when and why you need consensus, not just recite the algorithm.

Note

You will rarely implement Raft or Paxos from scratch. But you must understand their properties to make architectural decisions: "Should this service use a Raft-based store (etcd) or an eventually consistent one (Cassandra)?"

Paxos¶

Overview¶

Paxos (Lamport, 1989) is the foundational consensus protocol. It solves single-value consensus (single-decree Paxos) and can be extended to a replicated log (Multi-Paxos).

Roles¶

Role	Responsibility
Proposer	Proposes a value; drives the protocol
Acceptor	Votes on proposals; stores accepted values durably
Learner	Learns the decided value (often the same nodes)

Single-Decree Paxos (Two Phases)¶

Phase 1: Prepare

Proposer picks a unique, monotonically increasing proposal number n
Proposer sends Prepare(n) to a majority of acceptors
Each acceptor: if n is the highest it has seen, replies Promise(n, last_accepted_value) and promises not to accept proposals < n

Phase 2: Accept

If the proposer receives promises from a majority:
If any acceptor already accepted a value, the proposer must propose that value (not its own)
Otherwise, it proposes its own value
Proposer sends Accept(n, value) to the majority
Acceptors accept if they have not promised a higher number
Once a majority accepts, the value is chosen

sequenceDiagram
  participant P as Proposer
  participant A1 as Acceptor 1
  participant A2 as Acceptor 2
  participant A3 as Acceptor 3

  P->>A1: "Prepare(n=1)"
  P->>A2: "Prepare(n=1)"
  P->>A3: "Prepare(n=1)"
  A1-->>P: "Promise(n=1, none)"
  A2-->>P: "Promise(n=1, none)"
  Note over P: Majority promised
  P->>A1: "Accept(n=1, value=X)"
  P->>A2: "Accept(n=1, value=X)"
  A1-->>P: Accepted
  A2-->>P: Accepted
  Note over P: Value X is chosen

Why Paxos Is Hard in Practice¶

Challenge	Description
Dueling proposers	Two proposers keep preempting each other with higher proposal numbers; livelock
Multi-Paxos complexity	Extending to a log requires leader election, log compaction, and membership changes
Implementation subtlety	Edge cases in crash recovery, disk flushing, and message reordering

Tip

Staff-level insight: Paxos is theoretically foundational but notoriously difficult to implement correctly. This is precisely why Raft was invented—to provide an equivalent algorithm that is easier to understand, implement, and teach.

Raft¶

Overview¶

Raft (Ongaro & Ousterhout, 2014) solves the same problem as Multi-Paxos but is designed for understandability. It decomposes consensus into three subproblems: leader election, log replication, and safety.

Roles¶

Role	Description
Leader	Handles all client requests; replicates log entries to followers
Follower	Passive; responds to leader's append entries and vote requests
Candidate	A follower that has timed out and is seeking election

Leader Election¶

Each node starts as a follower with a randomized election timeout (e.g., 150–300ms)
If a follower receives no heartbeat before its timeout, it becomes a candidate
The candidate increments its term, votes for itself, and requests votes from all other nodes
A node grants its vote if: (a) it has not voted in this term, and (b) the candidate's log is at least as up-to-date
A candidate that receives votes from a majority becomes leader
The leader sends periodic heartbeats to maintain authority

stateDiagram-v2
  [*] --> Follower
  Follower --> Candidate: Election timeout
  Candidate --> Leader: Receives majority votes
  Candidate --> Follower: Discovers higher term
  Leader --> Follower: Discovers higher term
  Candidate --> Candidate: Election timeout (split vote)

Note

Randomized election timeouts prevent perpetual split votes. This is Raft's practical solution to the FLP impossibility result.

Log Replication¶

Client sends a command to the leader
Leader appends the command to its local log
Leader sends AppendEntries RPCs to all followers
Once a majority acknowledges, the leader commits the entry
Leader applies the committed entry to its state machine and responds to the client
Followers apply committed entries to their state machines

sequenceDiagram
  participant C as Client
  participant L as Leader
  participant F1 as Follower 1
  participant F2 as Follower 2

  C->>L: Command X
  L->>L: "Append to log (index 5, term 3)"
  L->>F1: "AppendEntries(index 5, term 3, X)"
  L->>F2: "AppendEntries(index 5, term 3, X)"
  F1-->>L: Success
  F2-->>L: Success
  Note over L: Majority replicated; commit index 5
  L->>L: Apply X to state machine
  L-->>C: Result

Safety Properties¶

Property	Guarantee
Election safety	At most one leader per term
Leader append-only	Leader never overwrites or deletes its own log entries
Log matching	If two logs contain an entry with the same index and term, all preceding entries are identical
Leader completeness	If a log entry is committed in a given term, it will be present in the logs of leaders of all higher terms
State machine safety	If a server applies a log entry at a given index, no other server will apply a different entry at that index

Paxos vs Raft: Comparison¶

Dimension	Paxos	Raft
Understandability	Difficult; many variants	Designed for clarity
Leader	Optional (Multi-Paxos uses one)	Mandatory; all writes go through leader
Log structure	Flexible; gaps allowed	Strict; no gaps; entries committed in order
Membership change	Complex	Joint consensus or single-server changes
Production systems	Google Chubby, Spanner (internally)	etcd, CockroachDB, TiKV, Consul
Performance	Comparable	Comparable (both require majority ack)

Practical Applications¶

System	Consensus Use
etcd	Raft-based key-value store; Kubernetes control plane
ZooKeeper	ZAB (Paxos variant) for configuration and coordination
Google Spanner	Paxos groups per split; TrueTime for external consistency
CockroachDB	Raft per range; multi-range transactions via 2PC
Kafka (KRaft)	Raft for metadata quorum (replacing ZooKeeper)

When to Use (and When Not to Use) Consensus¶

Use Consensus When	Avoid Consensus When
Leader election for a single-writer service	High-throughput writes where eventual consistency suffices
Distributed lock service (Chubby, etcd)	Session or presence data (ephemeral, tolerates staleness)
Replicated state machine (databases)	Metrics or analytics (small errors acceptable)
Configuration management	Caching layers (stale reads are by design)

Warning

Staff-level insight: Consensus is expensive (majority round-trip per write). Use it only for the coordination plane, not the data plane. For example, use Raft for leader election in a sharded database, but the data replication within a shard can use simpler async replication if the workload tolerates it.

Interview Tips¶

Signal	What to Say
When to use it	"We need consensus here because this component requires a single source of truth for leader assignment. I'd use an etcd lease for leader election."
When to avoid it	"For the user session store, consensus is overkill. We can use async replication with LWW; a stale session read just triggers a re-login."
Trade-off	"Consensus gives us linearizable reads at the cost of write latency bounded by the slowest majority node."

Resource	Topic	Why This Matters
Designing Data-Intensive Applications — Martin Kleppmann, Chapter 9 ("Consistency and Consensus")	The conceptual home of everything on this page	Chapter 9 builds the argument step by step: it first defines linearizability and total order broadcast, then proves that these, atomic commit, and leader election are all equivalent to consensus. It explains the FLP result, why a majority quorum is required, and how tools like ZooKeeper and etcd package consensus as a service. This is the single best source for why consensus matters rather than just how Raft works.
In Search of an Understandable Consensus Algorithm — Ongaro & Ousterhout (the Raft paper)	Raft, designed for clarity	The original Raft paper was explicitly written to be teachable. It decomposes consensus into leader election, log replication, and safety — the exact structure used in the Raft section above. The companion site raft.github.io has an interactive visualization that makes leader election and log replication click.
Paxos Made Simple — Leslie Lamport	The foundational protocol	Lamport's (famously not-that-simple) explanation of Paxos. Worth reading after Raft to appreciate why Raft was created — and to recognize Paxos when you meet it inside Chubby, Spanner, and Cassandra's lightweight transactions.

Consensus Algorithms (Raft & Paxos)¶

What Is Consensus?¶

Why Do We Need Consensus?¶

Why Consensus Matters in Interviews¶

Paxos¶

Overview¶

Roles¶

Single-Decree Paxos (Two Phases)¶

Why Paxos Is Hard in Practice¶

Raft¶

Overview¶

Roles¶

Leader Election¶

Log Replication¶

Safety Properties¶

Paxos vs Raft: Comparison¶

Practical Applications¶

When to Use (and When Not to Use) Consensus¶

Interview Tips¶

Further Reading¶