Skip to main content
🧠Advanced Topics

Distributed Consensus: Paxos, Raft, and Beyond

Distributed consensus algorithms enable multiple nodes to agree on a single value despite failures. Explore Paxos, Raft, and Byzantine fault tolerance...

📖 12 min read

Distributed Consensus: Paxos, Raft, and Beyond

Distributed consensus is the problem of getting multiple nodes in a distributed system to agree on a single value, even in the presence of failures. It is one of the most fundamental and challenging problems in distributed computing. Consensus protocols underpin critical infrastructure like distributed locks, leader election, configuration management, and replicated state machines. Understanding consensus is essential for anyone working with distributed systems.

The Consensus Problem

In a distributed system, nodes communicate by sending messages over an unreliable network. The consensus problem requires that a group of nodes agree on a single value, satisfying three properties:

Property Description
Agreement All non-faulty nodes decide on the same value
Validity The decided value must have been proposed by some node
Termination All non-faulty nodes eventually decide on a value

These seemingly simple requirements become incredibly difficult to satisfy when nodes can crash, messages can be lost or delayed, and the network can partition. This difficulty is formalized by the FLP impossibility result.

FLP Impossibility

The Fischer-Lynch-Paterson (FLP) impossibility result (1985) proves that no deterministic consensus protocol can guarantee all three properties in an asynchronous system where even a single node may crash. This is a foundational result in distributed computing.

The key insight is that in an asynchronous system, you cannot distinguish between a crashed node and a very slow one. Any protocol that waits for a potentially crashed node will violate termination, but any protocol that proceeds without waiting may violate agreement.

Practical consensus protocols work around FLP by using one or more of these approaches:

  • Timeouts and failure detectors: Assume partially synchronous networks where messages eventually arrive within a bounded time
  • Randomization: Use random choices to break symmetry and guarantee termination with probability 1
  • Leader-based approaches: Elect a leader to drive consensus, falling back to a new leader on timeout

Paxos

Paxos, invented by Leslie Lamport in 1989, is the foundational consensus protocol. It is proven correct and has influenced virtually every subsequent consensus algorithm. However, Paxos is notoriously difficult to understand and implement correctly.

Roles in Paxos

Paxos defines three roles (a single node may play multiple roles):

Role Responsibility
Proposer Proposes values and drives the protocol
Acceptor Votes on proposals and remembers accepted values
Learner Learns the decided value once a majority agrees

Basic Paxos Protocol

Basic Paxos decides a single value through two phases:

Phase 1: Prepare
  Proposer                    Acceptors
     |                           |
     |--- Prepare(n) ----------> |  (n = proposal number)
     |                           |
     |<-- Promise(n, prev) ----- |  Acceptor promises not to
     |                           |  accept proposals < n.
     |                           |  Returns previously accepted
     |                           |  value (if any).

Phase 2: Accept
  Proposer                    Acceptors
     |                           |
     |--- Accept(n, v) ------->  |  v = value (own or highest
     |                           |  previously accepted value)
     |                           |
     |<-- Accepted(n, v) ------  |  Acceptor accepts if it
     |                           |  has not promised a higher n.

Decision:
  When a majority of acceptors have accepted the same
  proposal (n, v), the value v is decided.
  Learners are notified of the decision.

The key insight is the proposal number n. Each proposer must use a unique, monotonically increasing proposal number. If a proposer learns during Phase 1 that an acceptor has already accepted a value, it must propose that value (not its own). This ensures that once a value is chosen, all future proposals will also choose that value.

Multi-Paxos

Basic Paxos decides a single value. To build a replicated log (sequence of commands), you need to run multiple instances of Paxos, one for each log entry. Multi-Paxos optimizes this by electing a stable leader that skips Phase 1 for subsequent proposals, reducing the number of message rounds from 4 to 2.

Multi-Paxos optimization:

Basic Paxos (per entry): Prepare -> Promise -> Accept -> Accepted
                         (4 messages, 2 round trips)

Multi-Paxos (with stable leader):
  First entry:  Prepare -> Promise -> Accept -> Accepted
  Next entries: Accept -> Accepted  (skip Phase 1)
                (2 messages, 1 round trip)

The leader keeps its proposal number and skips
Phase 1 as long as it remains leader.

Raft

Raft was designed by Diego Ongaro and John Ousterhout in 2014 specifically to be more understandable than Paxos while providing equivalent guarantees. Raft decomposes consensus into three independent subproblems: leader election, log replication, and safety.

Raft Node States

Every node in a Raft cluster is in one of three states:

  +----------+         +----------+         +---------+
  | Follower | ----->  | Candidate| ----->  | Leader  |
  +----------+         +----------+         +---------+
       ^                    |                    |
       |                    |                    |
       +--------------------+--------------------+
              (higher term discovered)
  • Follower: Passive state. Responds to requests from leaders and candidates
  • Candidate: Actively seeking votes to become leader
  • Leader: Handles all client requests and replicates log entries

Leader Election

Raft uses a term-based system. Time is divided into terms of arbitrary length, each beginning with an election:

Leader Election Process:

1. Follower times out (no heartbeat from leader)
2. Follower becomes Candidate:
   - Increments current term
   - Votes for itself
   - Sends RequestVote RPCs to all other nodes

3. Three possible outcomes:
   a) Wins election (majority votes) -> becomes Leader
   b) Another node wins -> becomes Follower
   c) No winner (split vote) -> new election with higher term

Election safety: At most one leader per term
(each node votes for at most one candidate per term)

Raft uses randomized election timeouts (e.g., 150-300ms) to reduce the probability of split votes. This simple mechanism works remarkably well in practice.

Log Replication

Once a leader is elected, it handles all client requests by appending them to its log and replicating to followers:

Log Replication:

Client         Leader        Followers
  |               |               |
  |-- Request --> |               |
  |               |-- AppendEntries --> |
  |               |               |
  |               |<-- Success -- |
  |               |               |
  |     (majority responded)      |
  |               |               |
  |  Entry committed, apply to    |
  |  state machine                |
  |               |               |
  |<-- Response --|               |
  |               |               |
  |               |-- Commit notification --> |
  |               |    (in next AppendEntries)|

Log structure:
  Index: [1]    [2]    [3]    [4]    [5]
  Term:   1      1      2      2      2
  Cmd:   x=1   y=2    x=3   y=4    z=5
         ^^^committed^^^   ^^uncommitted^^

An entry is committed once a majority of nodes have stored it. Committed entries are guaranteed to be durable and will eventually be applied to all state machines in the same order.

Safety Guarantees

Raft provides several important safety properties:

Property Guarantee
Election Safety At most one leader per term
Leader Append-Only Leader never overwrites or deletes its own log entries
Log Matching If two logs have an entry with same index and term, all preceding entries are identical
Leader Completeness If an entry is committed in a given term, it will be present in the logs of all leaders for higher terms
State Machine Safety If a node applies a log entry at a given index, no other node will apply a different entry at that index

Paxos vs Raft Comparison

Aspect Paxos Raft
Understandability Notoriously difficult Designed for understandability
Leader requirement Optional (Multi-Paxos uses one) Required (strong leader)
Log structure Gaps allowed No gaps (contiguous)
Leader election Implicit (highest proposal number) Explicit (RequestVote RPC)
Correctness proof Well-established (decades) Formal proof in TLA+
Membership changes Complex Joint consensus or single-server changes
Performance Similar (1-2 round trips) Similar (1-2 round trips)
Implementation complexity Very high Moderate
Industry adoption Google Chubby, Apache Zab etcd, Consul, CockroachDB

Practical Systems Using Consensus

etcd (Raft)

etcd is a distributed key-value store that uses Raft for consensus. It is the backbone of Kubernetes, storing all cluster state. etcd provides strong consistency guarantees: every read returns the most recent write. It typically runs as a 3 or 5-node cluster, tolerating 1 or 2 failures respectively.

Apache ZooKeeper (ZAB)

ZooKeeper uses ZAB (ZooKeeper Atomic Broadcast), a protocol similar to Paxos. ZooKeeper provides distributed coordination primitives including distributed locks, leader election, and service discovery. It guarantees linearizable writes and sequentially consistent reads.

HashiCorp Consul (Raft)

Consul uses Raft for its consensus layer, providing service discovery, health checking, and KV storage. Consul servers form a Raft cluster, while Consul clients forward requests to the servers. This architecture separates the consensus participants from the service mesh.

Google Spanner (Paxos)

Google Spanner uses Paxos for replication across data centers. Uniquely, Spanner uses synchronized clocks (TrueTime API) to provide externally consistent transactions across globally distributed data. Each Paxos group replicates a partition of the data.

Consensus in practice - typical cluster sizes:

Nodes    Tolerated Failures    Use Case
  3            1               Development, small systems
  5            2               Production, most workloads
  7            3               High availability requirements

Formula: Can tolerate F failures with 2F+1 nodes
Quorum size: F+1 (majority)

Byzantine Fault Tolerance

Paxos and Raft assume that nodes may crash but will never behave maliciously (crash-fault tolerance). Byzantine fault tolerance (BFT) handles nodes that may behave arbitrarily: sending conflicting messages, lying, or colluding.

Aspect Crash Fault Tolerance Byzantine Fault Tolerance
Failure model Nodes stop (crash) Nodes can be arbitrary (malicious)
Nodes needed 2F+1 to tolerate F failures 3F+1 to tolerate F failures
Message complexity O(N) per decision O(N^2) or more per decision
Use cases Internal infrastructure Blockchain, multi-party systems
Examples Paxos, Raft, ZAB PBFT, HotStuff, Tendermint

Practical BFT (PBFT)

PBFT, proposed by Castro and Liskov in 1999, is a practical BFT protocol. It works in three phases:

PBFT Protocol:

Client       Primary       Replicas (3F+1 total)
  |             |               |
  |-- Request ->|               |
  |             |               |
  |             |-- Pre-prepare ->|  (Phase 1)
  |             |               |
  |             |<- Prepare ---->|  (Phase 2: all-to-all)
  |             |               |
  |             |<- Commit ---->|  (Phase 3: all-to-all)
  |             |               |
  |<----------- Reply ----------|  (from 2F+1 replicas)

Requires 3F+1 nodes to tolerate F Byzantine faults.
Example: 4 nodes tolerate 1 Byzantine fault.

Choosing a Consensus Protocol

When choosing a consensus protocol for your system, consider:

  • Failure model: If you trust all nodes (internal infrastructure), use Raft or Paxos. If nodes may be malicious (blockchain, multi-organization), use BFT
  • Understandability: Raft is much easier to understand and implement correctly. If you are building from scratch, prefer Raft
  • Existing systems: Often you do not need to implement consensus yourself. Use etcd, ZooKeeper, or Consul as a consensus service
  • Performance requirements: Consensus adds latency (at least one round trip to a majority). For write-heavy workloads, consider partitioning data across multiple consensus groups
  • Geographic distribution: Cross-datacenter consensus has higher latency. Consider protocols optimized for WAN, like EPaxos or flexible Paxos

Consensus protocols work hand-in-hand with partial failure handling, quorum systems, and distributed transaction patterns. Together, they form the backbone of reliable distributed systems.

Frequently Asked Questions

Q: Why do consensus protocols need an odd number of nodes?

They do not strictly require an odd number, but even numbers offer no advantage. A 4-node cluster tolerates 1 failure (same as 3 nodes) because both require a majority of at least 3. Adding the fourth node increases cost without improving fault tolerance. Odd numbers (3, 5, 7) are optimal for the ratio of nodes to fault tolerance.

Q: What happens during a network partition?

During a network partition, only the partition containing a majority of nodes can make progress. The minority partition cannot form a quorum and stops accepting writes. This ensures consistency (no split-brain) at the cost of availability for the minority side. This is a direct consequence of the CAP theorem - consensus protocols choose consistency over availability during partitions.

Q: How fast is consensus in practice?

Within a data center, Raft and Paxos typically achieve consensus in 1-5 milliseconds. Across data centers, latency depends on network distance: same region might be 5-20ms, cross-continent can be 100-300ms. Systems like Google Spanner achieve externally consistent transactions at about 10ms within a region using hardware-synchronized clocks.

Q: Can consensus scale to hundreds of nodes?

Standard Raft and Paxos do not scale well beyond 5-7 nodes because every write requires a majority acknowledgment. For larger systems, use a small consensus group (3-5 nodes) for metadata and coordination, then use other protocols like gossip for data propagation to the wider cluster. This is exactly what systems like Consul and Cassandra do.

Related Articles