Multi-Region Systems: Architecture Patterns and Data Consistency

Multi-region systems deploy application infrastructure across two or more geographic regions to provide low latency, high availability, and disaster recovery. While the concept is straightforward, the implementation involves complex trade-offs between data consistency, operational complexity, and cost. This guide covers the major architecture patterns, failover mechanisms, and real-world approaches used by companies like Netflix, Spotify, and Amazon.

Architecture Patterns

1. Active-Passive (Warm Standby)

One region handles all traffic. A secondary region receives replicated data and is ready to take over if the primary fails.

# Active-Passive architecture
# Primary (us-east-1): Handles all reads and writes
# Secondary (eu-west-1): Receives async replication, on standby

# Failover process:
# 1. Health check detects primary failure
# 2. Promote secondary database to primary
# 3. Update DNS to point to secondary region
# 4. Secondary becomes the new primary

# DNS failover configuration
resource "aws_route53_health_check" "primary" {
  fqdn              = "primary-api.example.com"
  port               = 443
  type               = "HTTPS"
  resource_path      = "/health"
  failure_threshold  = 3
  request_interval   = 10
}

resource "aws_route53_record" "api" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "api.example.com"
  type    = "A"

  failover_routing_policy {
    type = "PRIMARY"
  }
  set_identifier  = "primary"
  health_check_id = aws_route53_health_check.primary.id

  alias {
    name    = aws_lb.primary.dns_name
    zone_id = aws_lb.primary.zone_id
  }
}

2. Active-Active

All regions handle both reads and writes simultaneously. This provides the lowest latency but requires conflict resolution for concurrent writes.

class ActiveActiveRouter:
    def __init__(self, regions):
        self.regions = regions

    def route_request(self, request):
        # Route to nearest region based on user location
        user_region = self.get_nearest_region(request.source_ip)

        if request.method in ["GET", "HEAD"]:
            # Read from local region
            return user_region.handle(request)
        else:
            # Write to local region
            result = user_region.handle(request)
            # Async replication handles propagation
            return result

    def get_nearest_region(self, ip):
        user_location = geoip.lookup(ip)
        return min(self.regions,
                  key=lambda r: haversine(user_location, r.location))

3. Read-Local, Write-Global

Reads are served from local replicas, but writes are always routed to a designated primary region. This avoids write conflicts while keeping read latency low.

class ReadLocalWriteGlobal:
    def __init__(self, local_region, primary_region):
        self.local = local_region
        self.primary = primary_region

    def read(self, key):
        # Read from local replica (fast, potentially stale)
        return self.local.read(key)

    def read_consistent(self, key):
        # Read from primary for strong consistency
        return self.primary.read(key)

    def write(self, key, value):
        # Write always goes to primary region
        result = self.primary.write(key, value)
        # Replication propagates to local region async
        return result

Data Consistency Challenges

Challenge	Description	Solution
Replication Lag	Data written in one region is not immediately visible in others	Read-your-writes consistency, sticky sessions
Write Conflicts	Same data modified in two regions simultaneously	LWW, CRDTs, region ownership
Split Brain	Regions cannot communicate, both accept writes	Quorum systems, fencing
Cross-Region Transactions	ACID transactions spanning regions	Saga pattern, avoid when possible

Failover Mechanisms

class FailoverManager:
    def __init__(self, regions, health_check_interval=10):
        self.regions = regions
        self.primary = regions[0]
        self.health_check_interval = health_check_interval

    def health_check(self):
        for region in self.regions:
            try:
                response = requests.get(
                    f"https://{region.endpoint}/health",
                    timeout=5
                )
                region.healthy = response.status_code == 200
            except Exception:
                region.healthy = False
                region.consecutive_failures += 1

    def failover(self):
        if not self.primary.healthy:
            # Find the best healthy region
            candidates = [r for r in self.regions
                         if r.healthy and r != self.primary]
            if candidates:
                new_primary = min(candidates,
                                 key=lambda r: r.replication_lag)

                # Promote new primary
                new_primary.promote_to_primary()

                # Update DNS
                self.update_dns(new_primary.endpoint)

                # Update replication topology
                for region in self.regions:
                    if region != new_primary:
                        region.replicate_from(new_primary)

                self.primary = new_primary
                alert("Failover completed to " + new_primary.name)

Cost Considerations

Cost Category	Single Region	Multi-Region (Active-Passive)	Multi-Region (Active-Active)
Compute	1x	1.5-2x	2-3x
Storage	1x	2x	2-3x
Data Transfer	Minimal	Moderate (replication)	High (bidirectional)
Operational Overhead	1x	2x	3-5x

Real-World Examples

Netflix: Active-active across 3 AWS regions (us-east, us-west, eu-west). Uses EVCache for multi-region caching and custom replication for their data stores. Regularly runs Chaos Monkey and region evacuation drills.
Spotify: Uses Google Cloud with multi-region deployment. Data is partitioned by user home region. Cross-region reads use eventual consistency for non-critical data.
CockroachDB: Provides built-in multi-region with configurable survival goals (zone failure, region failure) and data domiciling for compliance.

Multi-region systems connect to geo-distribution for deployment strategies, leader election for primary selection, and data sync for conflict resolution.

Frequently Asked Questions

Q: How do I test multi-region failover?

Run regular failover drills (Netflix calls these "region evacuations"). Simulate region failure by blocking traffic to a region or shutting down its services. Automate failover testing in your CI/CD pipeline. Game days where the team practices responding to region failures build operational confidence.

Q: What is the RTO and RPO for multi-region systems?

Active-passive typically achieves RTO (Recovery Time Objective) of 5-30 minutes and RPO (Recovery Point Objective) of seconds to minutes (replication lag). Active-active achieves near-zero RTO (traffic automatically routes to healthy regions) and near-zero RPO (writes are local).

Q: How do I handle sessions in a multi-region system?

Use region-affine sessions: route a user to the same region via cookies or geo-routing. Store sessions in a regionally-local cache (Redis) with async replication. Alternatively, use stateless JWTs that any region can verify without shared state.

Q: What databases support multi-region natively?

CockroachDB, Google Spanner, Azure Cosmos DB, DynamoDB Global Tables, and YugabyteDB all support multi-region deployment with varying consistency guarantees. For relational databases, use managed replication (Aurora Global Database, Cloud SQL cross-region replicas).

Multi-Region Systems: Architecture Patterns and Data Consistency