High Availability: Architecture Patterns and Strategies

High availability (HA) means designing systems that remain operational and accessible for the maximum possible time, typically targeting 99.9% uptime or higher. In a world where users expect 24/7 access and even minutes of downtime can cost millions in lost revenue, understanding HA patterns is essential for every systems engineer.

This guide covers HA architecture patterns, failover strategies, health checks, multi-region deployment, and real-world examples of how companies achieve high availability.

What Makes a System Highly Available?

A highly available system minimizes downtime by eliminating single points of failure and automating recovery. The three pillars of HA are:

Redundancy: Multiple instances of every critical component
Detection: Fast identification of failures (health checks, monitoring)
Recovery: Automatic failover to healthy components (low MTTR)

As covered in our availability vs reliability guide, availability is measured using the formula: Availability = MTBF / (MTBF + MTTR). HA focuses heavily on reducing MTTR through automation.

Active-Active vs Active-Passive

The two fundamental HA patterns determine how redundant components share the workload.

Active-Active

All instances actively handle traffic simultaneously. Load is distributed across all nodes. If one node fails, the remaining nodes absorb its traffic with zero switchover time.

Active-Active Architecture:

                    Load Balancer
                   /      |      \
              Server A  Server B  Server C
              (active)  (active)  (active)

Normal: Each server handles ~33% of traffic
Failure: Server B dies → A and C each handle ~50%
Recovery: Zero downtime, automatic rebalancing

Active-Passive

One instance (primary) handles all traffic while one or more standbys wait idle. If the primary fails, a standby is promoted to primary. There is a brief failover period during the switch.

Active-Passive Architecture:

              Primary Server  ←→  Standby Server
              (handles all         (idle, receives
               traffic)             data replication)

Normal: Primary handles 100% of traffic
Failure: Primary dies → Standby promoted (30-120 sec failover)
Recovery: Brief downtime during switchover

Comparison

Aspect	Active-Active	Active-Passive
Failover Time	Near zero	Seconds to minutes
Resource Utilization	All resources active (efficient)	Standby is idle (wasteful)
Complexity	Higher (state sync, conflict resolution)	Lower (clear primary/secondary roles)
Consistency	May have conflicts (concurrent writes)	Simpler (single writer)
Cost	More efficient use of resources	Paying for idle standby
Best For	Stateless services, read-heavy workloads	Databases, stateful services

Health Checks

Health checks are the eyes of your HA system. They detect failures so the system can react. Without effective health checks, a failed component might continue receiving traffic.

Types of Health Checks

Type	What It Checks	Example
Liveness	Is the process running?	TCP connection to port 8080
Readiness	Can it handle requests?	HTTP GET /health returns 200
Deep Health	Are all dependencies healthy?	Check DB connection, cache, external APIs

# Example health check endpoint
from flask import Flask, jsonify

app = Flask(__name__)

@app.route('/health')
def health_check():
    """Basic liveness check"""
    return jsonify({"status": "healthy"}), 200

@app.route('/health/ready')
def readiness_check():
    """Deep health check - verify all dependencies"""
    checks = {
        "database": check_database(),
        "cache": check_redis(),
        "queue": check_message_queue()
    }

    all_healthy = all(checks.values())
    status_code = 200 if all_healthy else 503

    return jsonify({
        "status": "ready" if all_healthy else "not_ready",
        "checks": checks
    }), status_code

def check_database():
    try:
        db.execute("SELECT 1")
        return True
    except Exception:
        return False

Health Check Configuration

Load Balancer Health Check Settings:

Interval:           10 seconds  (how often to check)
Timeout:             5 seconds  (max wait for response)
Healthy threshold:   3          (consecutive successes to mark healthy)
Unhealthy threshold: 2          (consecutive failures to mark unhealthy)

Time to detect failure: ~20 seconds
  (2 missed checks × 10 second interval)

Time to restore:       ~30 seconds
  (3 successful checks × 10 second interval)

Failover Strategies

DNS-Based Failover

Route 53, Cloudflare, or other DNS providers can route traffic away from unhealthy endpoints. DNS failover is the coarsest level — it routes entire domains to different IP addresses. The limitation is DNS TTL: clients cache DNS records, so failover is not instantaneous.

Load Balancer Failover

Application load balancers (ALB, NLB) perform health checks on backend servers and automatically stop routing to unhealthy ones. This is faster than DNS (seconds vs minutes) and is the most common failover mechanism for web applications.

Database Failover

Database failover is the most critical and complex. Options include automatic promotion of a read replica to primary (AWS RDS Multi-AZ), consensus-based leader election (etcd, Zookeeper), and multi-master replication (CockroachDB, Spanner).

Database Failover Timeline (AWS RDS Multi-AZ):

0s:    Primary database fails
0-10s: Failure detected by health monitoring
10-30s: Standby promoted to primary
30-60s: DNS updated to point to new primary
60-90s: Application connections drain and reconnect

Total failover: 60-120 seconds
Data loss: Zero (synchronous replication to standby)

Multi-Region High Availability

For the highest levels of availability (99.99%+), you need multi-region deployment. This protects against entire region failures — natural disasters, power grid outages, or cloud provider issues.

Multi-Region Patterns

Pattern	Description	Complexity	RTO
Backup/Restore	Backups in another region, restore on failure	Low	Hours
Pilot Light	Minimal infra running in DR region, scale up on failure	Medium	Minutes to an hour
Warm Standby	Scaled-down version running in DR region	Medium-High	Minutes
Active-Active Multi-Region	Full capacity in all regions, traffic served locally	Very High	Near zero

Active-active multi-region is the gold standard but introduces complex consistency challenges. How do you handle a user who updates their profile in US-East while simultaneously being read in EU-West? The CAP theorem applies: during cross-region network issues, you must choose between consistency and availability.

SLAs and Availability Targets

High availability requirements are formalized in Service Level Agreements (SLAs). Common targets:

Availability Targets by System Type:

Internal tools:           99.0%  (3.65 days downtime/year)
B2B SaaS:                 99.9%  (8.77 hours downtime/year)
Consumer web applications: 99.95% (4.38 hours downtime/year)
E-commerce / payments:     99.99% (52.6 minutes downtime/year)
Infrastructure services:   99.999% (5.26 minutes downtime/year)

Each additional nine roughly doubles engineering cost.

HA Anti-Patterns

Manual failover: If failover requires human intervention (paging an on-call engineer, running scripts), your MTTR will be measured in minutes to hours, not seconds.
Untested failover: Failover mechanisms that are never tested often fail when needed. Run regular failover drills.
Single region: Running everything in one region means a region-level outage takes down your entire service. Even within a region, use multiple availability zones.
Ignoring dependencies: Your system's availability is limited by its least available dependency. A 99.99% service that depends on a 99.9% database is really a 99.9% system.
Missing graceful degradation: When a non-critical service fails, the whole system should not fail. Show cached data, disable non-essential features, or serve a simplified experience.

Real-World HA Architecture: E-Commerce Platform

HA Architecture for an E-Commerce Platform:

DNS Layer:
  Route 53 with health checks → failover between regions

CDN Layer:
  CloudFront / Akamai → cache static assets globally

Load Balancer Layer:
  ALB in each AZ → health checks every 10 seconds
  Cross-zone load balancing enabled

Application Layer:
  Auto-scaling group: min=4, max=20
  Deployed across 3 availability zones
  Stateless design → any server handles any request

Cache Layer:
  Redis cluster with 3 nodes (1 primary + 2 replicas)
  ElastiCache with Multi-AZ enabled

Database Layer:
  RDS PostgreSQL Multi-AZ (synchronous replication)
  Read replicas in 2 additional AZs
  Automated backups with point-in-time recovery

Queue Layer:
  SQS for async processing (managed, highly available)
  Dead letter queue for failed messages

Monitoring:
  CloudWatch alarms → SNS → PagerDuty
  Application-level metrics (request count, error rate, latency)

For related concepts, see Fault Tolerance, Scalability, and Trade-offs in System Design.

Frequently Asked Questions

What is the difference between high availability and disaster recovery?

High availability handles component-level failures with automatic failover (server crash, AZ outage). Disaster recovery handles region-level or catastrophic failures (natural disaster, entire cloud region down). HA has near-zero RTO (recovery time). DR typically has RTO measured in minutes to hours. Most production systems need both.

How do I achieve high availability for databases?

Use Multi-AZ deployments with synchronous replication (AWS RDS Multi-AZ, Azure SQL HA). For the highest availability, use globally distributed databases like CockroachDB or Google Spanner. For read-heavy workloads, add read replicas across availability zones. Always have automated backups and test your restore process regularly.

Is active-active always better than active-passive?

Not always. Active-active is better for stateless services and provides better resource utilization. But for stateful services (databases), active-active introduces consistency challenges — concurrent writes to different nodes can conflict. Active-passive is simpler and ensures strong consistency for databases. Choose based on your consistency and complexity tolerance.

How do I handle split-brain in active-passive failover?

Split-brain occurs when both the primary and standby think they are the active node. Prevent this with: fencing (STONITH — Shoot The Other Node In The Head), quorum-based decisions (require a majority vote to become primary), and shared storage with locks. Cloud-managed databases (RDS, Cloud SQL) handle this automatically.

What is the minimum availability I should target?

It depends on your business. Calculate the cost of downtime per hour (lost revenue + customer trust + SLA penalties). If downtime costs less than the engineering investment to prevent it, your current availability is sufficient. Most B2B SaaS should target at least 99.9% (three nines). Consumer-facing applications often need 99.95% or higher.

High Availability: Architecture Patterns and Strategies