Distributed Databases: Architecture and Trade-offs for Global Scale

Distributed databases store data across multiple servers, data centers, or geographic regions. They solve the fundamental limitations of single-server databases — capacity, availability, and latency — but introduce complex challenges around consistency, coordination, and partition tolerance. This guide covers the leading distributed databases, consensus protocols, and the architecture decisions that matter for system design.

Why Distributed Databases?

Single-server databases hit hard limits: a single machine has finite CPU, memory, disk, and network bandwidth. When your application needs to handle millions of requests per second, store petabytes of data, or serve users across the globe with low latency, you need a distributed database.

Distributed databases provide: horizontal scalability (add more nodes to increase capacity), high availability (survive node and data center failures), geographic distribution (place data close to users), and fault tolerance (no single point of failure).

Google Spanner

Google Spanner is a globally distributed, strongly consistent database. It is the first system to achieve external consistency (the strongest form of consistency) across global scale. Spanner is available as a managed service through Google Cloud Spanner.

TrueTime and External Consistency

Spanner's breakthrough is TrueTime — a globally synchronized clock API that uses GPS receivers and atomic clocks in every data center. TrueTime provides a bounded uncertainty interval for the current time, allowing Spanner to order transactions globally without communication between data centers.

When a transaction commits, Spanner waits out the clock uncertainty (typically a few milliseconds) before making the commit visible. This wait ensures that any subsequent transaction, on any node worldwide, will see the committed data. The result is linearizability — the gold standard of consistency.

-- Google Spanner: SQL with global distribution
-- Create an interleaved table hierarchy for data locality
CREATE TABLE Users (
    UserId INT64 NOT NULL,
    Name STRING(100),
    Email STRING(255),
) PRIMARY KEY (UserId);

CREATE TABLE Orders (
    UserId INT64 NOT NULL,
    OrderId INT64 NOT NULL,
    Total NUMERIC,
    CreatedAt TIMESTAMP NOT NULL OPTIONS (allow_commit_timestamp=true),
) PRIMARY KEY (UserId, OrderId),
  INTERLEAVE IN PARENT Users ON DELETE CASCADE;

-- Spanner distributes data by primary key ranges (splits)
-- Interleaved tables are co-located with their parent for fast joins

-- Read-write transaction (strong consistency, global)
SELECT Total FROM Orders
WHERE UserId = 42 AND OrderId = 1001;

-- Stale read (faster, reads data up to 15 seconds old)
SELECT * FROM Users@{TIMESTAMP_BOUND=MAX_STALENESS 15s}
WHERE UserId = 42;

CockroachDB

CockroachDB is an open-source, distributed SQL database inspired by Spanner. It provides serializable transactions, automatic sharding, and multi-region deployment. Unlike Spanner, it does not require specialized hardware — it uses NTP for clock synchronization with a configurable maximum clock offset.

CockroachDB uses the Raft consensus protocol to replicate data across nodes. Each range (chunk of data, default 512MB) has a Raft group of 3 or more replicas. Reads and writes go through the Raft leader, ensuring strong consistency.

-- CockroachDB: Standard SQL with distribution
CREATE TABLE users (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    name STRING NOT NULL,
    email STRING UNIQUE NOT NULL,
    region STRING NOT NULL
);

-- Pin data to specific regions for compliance and latency
ALTER TABLE users SET LOCALITY REGIONAL BY ROW;

-- Multi-region configuration
ALTER DATABASE mydb SET PRIMARY REGION "us-east1";
ALTER DATABASE mydb ADD REGION "eu-west1";
ALTER DATABASE mydb ADD REGION "ap-southeast1";

-- Survive entire region failures
ALTER DATABASE mydb SURVIVE REGION FAILURE;

Apache Cassandra

Cassandra is a wide-column distributed database designed for massive write throughput and high availability. It has no single point of failure — every node is equal (peer-to-peer architecture). Cassandra uses a gossip protocol for cluster membership and consistent hashing for data distribution.

Tunable Consistency

Cassandra lets you choose the consistency level per query. This flexibility allows you to balance consistency against latency and availability for each operation independently.

Consistency Level	Behavior	Trade-off
ONE	Any one replica responds	Fastest, may return stale data
QUORUM	Majority of replicas respond	Balanced consistency and speed
ALL	All replicas respond	Strongest consistency, slowest
LOCAL_QUORUM	Quorum within local data center	Low latency for multi-DC setups

-- Cassandra: Design tables around query patterns
CREATE TABLE user_activity (
    user_id UUID,
    activity_time TIMESTAMP,
    activity_type TEXT,
    details TEXT,
    PRIMARY KEY (user_id, activity_time)
) WITH CLUSTERING ORDER BY (activity_time DESC);

-- Write with QUORUM consistency
INSERT INTO user_activity (user_id, activity_time, activity_type, details)
VALUES (uuid(), toTimestamp(now()), 'login', 'Web browser')
USING CONSISTENCY QUORUM;

-- Read with LOCAL_QUORUM for fast multi-DC reads
SELECT * FROM user_activity
WHERE user_id = some_uuid
AND activity_time > '2024-01-01'
USING CONSISTENCY LOCAL_QUORUM;

Amazon DynamoDB

DynamoDB is a fully managed key-value and document database by AWS. It provides single-digit millisecond performance at any scale with automatic scaling, built-in security, and backup. DynamoDB uses a proprietary distributed architecture with consistent hashing for partitioning.

# DynamoDB: Table design with partition and sort keys
import boto3

dynamodb = boto3.resource('dynamodb')

table = dynamodb.create_table(
    TableName='UserOrders',
    KeySchema=[
        {'AttributeName': 'user_id', 'KeyType': 'HASH'},   # Partition key
        {'AttributeName': 'order_date', 'KeyType': 'RANGE'} # Sort key
    ],
    AttributeDefinitions=[
        {'AttributeName': 'user_id', 'AttributeType': 'S'},
        {'AttributeName': 'order_date', 'AttributeType': 'S'}
    ],
    BillingMode='PAY_PER_REQUEST'  # On-demand scaling
)

# Strong consistent read (always latest data, 2x cost)
response = table.get_item(
    Key={'user_id': 'user-42', 'order_date': '2024-03-15'},
    ConsistentRead=True
)

# Eventually consistent read (default, may be stale by ~1 second)
response = table.get_item(
    Key={'user_id': 'user-42', 'order_date': '2024-03-15'}
)

Consensus Protocols

Raft

Raft is a consensus algorithm designed for understandability. A Raft group elects a leader, and all writes go through the leader. The leader replicates writes to followers, and a write is committed when a majority acknowledges it. Used by CockroachDB, TiDB, and etcd.

Paxos

Paxos is the original consensus algorithm, proven to be correct but notoriously difficult to implement. Google Spanner uses a variant called Multi-Paxos. Paxos and Raft provide equivalent guarantees but differ in implementation complexity.

Gossip Protocol

Gossip is not a consensus protocol but a communication protocol for cluster membership and failure detection. Each node periodically exchanges state information with random peers. Used by Cassandra for cluster coordination. It is eventually consistent and highly scalable.

Comparison of Distributed Databases

Feature	Spanner	CockroachDB	Cassandra	DynamoDB
Model	Relational (SQL)	Relational (SQL)	Wide-column	Key-value/Document
Consistency	External (strongest)	Serializable	Tunable	Strong or eventual
Consensus	Multi-Paxos	Raft	None (gossip)	Proprietary
CAP Position	CP	CP	AP (default)	AP or CP per-query
Best For	Global ACID transactions	Distributed SQL workloads	High write throughput	Serverless, key-value access

CAP Theorem in Practice

The CAP theorem states that during a network partition, a distributed system must choose between consistency (all nodes see the same data) and availability (all nodes respond to requests). This is not a permanent choice — modern databases make this trade-off dynamically and per-operation.

Spanner and CockroachDB choose consistency (CP): during a partition, they may become unavailable rather than return stale data. Cassandra chooses availability (AP): during a partition, it continues serving requests but may return stale data. DynamoDB gives you both options per query.

For more on consistency trade-offs, see our guides on ACID vs BASE and database replication. For SQL-compatible distributed options, see NewSQL databases.

Frequently Asked Questions

Should I use a distributed database for my application?

Most applications do not need a distributed database. A single PostgreSQL server with read replicas handles surprising amounts of traffic. Consider distributed databases when you need: global data distribution, write scalability beyond a single server, or multi-region active-active deployments. The operational complexity of distributed databases is significant.

How does CockroachDB compare to PostgreSQL?

CockroachDB is wire-compatible with PostgreSQL and supports most PostgreSQL SQL syntax. However, it has higher latency per query (consensus overhead), does not support all PostgreSQL features (stored procedures, triggers), and has different performance characteristics. Use CockroachDB when you need horizontal scalability; use PostgreSQL when a single server suffices.

When should I choose Cassandra over DynamoDB?

Choose Cassandra when you need on-premise deployment, multi-cloud portability, or fine-grained control over cluster configuration. Choose DynamoDB when you want fully managed operations, serverless scaling, and deep AWS integration. DynamoDB has lower operational overhead; Cassandra offers more flexibility.

What is the difference between Raft and Paxos?

Both achieve the same goal — consensus among distributed nodes — with equivalent safety guarantees. Raft was designed to be understandable: it decomposes consensus into leader election, log replication, and safety. Paxos is mathematically elegant but difficult to implement correctly. Most modern systems choose Raft for its clarity.

Distributed Databases: Architecture and Trade-offs for Global Scale