High Availability: Architecture Patterns and Strategies
High availability (HA) means designing systems that remain operational and accessible for the maximum possible time, typically targeting 99.9% uptime or higher. In a world where users expect 24/7 access and even minutes of downtime can cost millions in lost revenue, understanding HA patterns is essential for every systems engineer.
This guide covers HA architecture patterns, failover strategies, health checks, multi-region deployment, and real-world examples of how companies achieve high availability.
What Makes a System Highly Available?
A highly available system minimizes downtime by eliminating single points of failure and automating recovery. The three pillars of HA are:
- Redundancy: Multiple instances of every critical component
- Detection: Fast identification of failures (health checks, monitoring)
- Recovery: Automatic failover to healthy components (low MTTR)
As covered in our availability vs reliability guide, availability is measured using the formula: Availability = MTBF / (MTBF + MTTR). HA focuses heavily on reducing MTTR through automation.
Active-Active vs Active-Passive
The two fundamental HA patterns determine how redundant components share the workload.
Active-Active
All instances actively handle traffic simultaneously. Load is distributed across all nodes. If one node fails, the remaining nodes absorb its traffic with zero switchover time.
Active-Active Architecture:
Load Balancer
/ | \
Server A Server B Server C
(active) (active) (active)
Normal: Each server handles ~33% of traffic
Failure: Server B dies → A and C each handle ~50%
Recovery: Zero downtime, automatic rebalancing
Active-Passive
One instance (primary) handles all traffic while one or more standbys wait idle. If the primary fails, a standby is promoted to primary. There is a brief failover period during the switch.
Active-Passive Architecture:
Primary Server ←→ Standby Server
(handles all (idle, receives
traffic) data replication)
Normal: Primary handles 100% of traffic
Failure: Primary dies → Standby promoted (30-120 sec failover)
Recovery: Brief downtime during switchover
Comparison
| Aspect | Active-Active | Active-Passive |
|---|---|---|
| Failover Time | Near zero | Seconds to minutes |
| Resource Utilization | All resources active (efficient) | Standby is idle (wasteful) |
| Complexity | Higher (state sync, conflict resolution) | Lower (clear primary/secondary roles) |
| Consistency | May have conflicts (concurrent writes) | Simpler (single writer) |
| Cost | More efficient use of resources | Paying for idle standby |
| Best For | Stateless services, read-heavy workloads | Databases, stateful services |
Health Checks
Health checks are the eyes of your HA system. They detect failures so the system can react. Without effective health checks, a failed component might continue receiving traffic.
Types of Health Checks
| Type | What It Checks | Example |
|---|---|---|
| Liveness | Is the process running? | TCP connection to port 8080 |
| Readiness | Can it handle requests? | HTTP GET /health returns 200 |
| Deep Health | Are all dependencies healthy? | Check DB connection, cache, external APIs |
# Example health check endpoint
from flask import Flask, jsonify
app = Flask(__name__)
@app.route('/health')
def health_check():
"""Basic liveness check"""
return jsonify({"status": "healthy"}), 200
@app.route('/health/ready')
def readiness_check():
"""Deep health check - verify all dependencies"""
checks = {
"database": check_database(),
"cache": check_redis(),
"queue": check_message_queue()
}
all_healthy = all(checks.values())
status_code = 200 if all_healthy else 503
return jsonify({
"status": "ready" if all_healthy else "not_ready",
"checks": checks
}), status_code
def check_database():
try:
db.execute("SELECT 1")
return True
except Exception:
return False
Health Check Configuration
Load Balancer Health Check Settings:
Interval: 10 seconds (how often to check)
Timeout: 5 seconds (max wait for response)
Healthy threshold: 3 (consecutive successes to mark healthy)
Unhealthy threshold: 2 (consecutive failures to mark unhealthy)
Time to detect failure: ~20 seconds
(2 missed checks × 10 second interval)
Time to restore: ~30 seconds
(3 successful checks × 10 second interval)
Failover Strategies
DNS-Based Failover
Route 53, Cloudflare, or other DNS providers can route traffic away from unhealthy endpoints. DNS failover is the coarsest level — it routes entire domains to different IP addresses. The limitation is DNS TTL: clients cache DNS records, so failover is not instantaneous.
Load Balancer Failover
Application load balancers (ALB, NLB) perform health checks on backend servers and automatically stop routing to unhealthy ones. This is faster than DNS (seconds vs minutes) and is the most common failover mechanism for web applications.
Database Failover
Database failover is the most critical and complex. Options include automatic promotion of a read replica to primary (AWS RDS Multi-AZ), consensus-based leader election (etcd, Zookeeper), and multi-master replication (CockroachDB, Spanner).
Database Failover Timeline (AWS RDS Multi-AZ):
0s: Primary database fails
0-10s: Failure detected by health monitoring
10-30s: Standby promoted to primary
30-60s: DNS updated to point to new primary
60-90s: Application connections drain and reconnect
Total failover: 60-120 seconds
Data loss: Zero (synchronous replication to standby)
Multi-Region High Availability
For the highest levels of availability (99.99%+), you need multi-region deployment. This protects against entire region failures — natural disasters, power grid outages, or cloud provider issues.
Multi-Region Patterns
| Pattern | Description | Complexity | RTO |
|---|---|---|---|
| Backup/Restore | Backups in another region, restore on failure | Low | Hours |
| Pilot Light | Minimal infra running in DR region, scale up on failure | Medium | Minutes to an hour |
| Warm Standby | Scaled-down version running in DR region | Medium-High | Minutes |
| Active-Active Multi-Region | Full capacity in all regions, traffic served locally | Very High | Near zero |
Active-active multi-region is the gold standard but introduces complex consistency challenges. How do you handle a user who updates their profile in US-East while simultaneously being read in EU-West? The CAP theorem applies: during cross-region network issues, you must choose between consistency and availability.
SLAs and Availability Targets
High availability requirements are formalized in Service Level Agreements (SLAs). Common targets:
Availability Targets by System Type:
Internal tools: 99.0% (3.65 days downtime/year)
B2B SaaS: 99.9% (8.77 hours downtime/year)
Consumer web applications: 99.95% (4.38 hours downtime/year)
E-commerce / payments: 99.99% (52.6 minutes downtime/year)
Infrastructure services: 99.999% (5.26 minutes downtime/year)
Each additional nine roughly doubles engineering cost.
HA Anti-Patterns
- Manual failover: If failover requires human intervention (paging an on-call engineer, running scripts), your MTTR will be measured in minutes to hours, not seconds.
- Untested failover: Failover mechanisms that are never tested often fail when needed. Run regular failover drills.
- Single region: Running everything in one region means a region-level outage takes down your entire service. Even within a region, use multiple availability zones.
- Ignoring dependencies: Your system's availability is limited by its least available dependency. A 99.99% service that depends on a 99.9% database is really a 99.9% system.
- Missing graceful degradation: When a non-critical service fails, the whole system should not fail. Show cached data, disable non-essential features, or serve a simplified experience.
Real-World HA Architecture: E-Commerce Platform
HA Architecture for an E-Commerce Platform:
DNS Layer:
Route 53 with health checks → failover between regions
CDN Layer:
CloudFront / Akamai → cache static assets globally
Load Balancer Layer:
ALB in each AZ → health checks every 10 seconds
Cross-zone load balancing enabled
Application Layer:
Auto-scaling group: min=4, max=20
Deployed across 3 availability zones
Stateless design → any server handles any request
Cache Layer:
Redis cluster with 3 nodes (1 primary + 2 replicas)
ElastiCache with Multi-AZ enabled
Database Layer:
RDS PostgreSQL Multi-AZ (synchronous replication)
Read replicas in 2 additional AZs
Automated backups with point-in-time recovery
Queue Layer:
SQS for async processing (managed, highly available)
Dead letter queue for failed messages
Monitoring:
CloudWatch alarms → SNS → PagerDuty
Application-level metrics (request count, error rate, latency)
For related concepts, see Fault Tolerance, Scalability, and Trade-offs in System Design.
Frequently Asked Questions
What is the difference between high availability and disaster recovery?
High availability handles component-level failures with automatic failover (server crash, AZ outage). Disaster recovery handles region-level or catastrophic failures (natural disaster, entire cloud region down). HA has near-zero RTO (recovery time). DR typically has RTO measured in minutes to hours. Most production systems need both.
How do I achieve high availability for databases?
Use Multi-AZ deployments with synchronous replication (AWS RDS Multi-AZ, Azure SQL HA). For the highest availability, use globally distributed databases like CockroachDB or Google Spanner. For read-heavy workloads, add read replicas across availability zones. Always have automated backups and test your restore process regularly.
Is active-active always better than active-passive?
Not always. Active-active is better for stateless services and provides better resource utilization. But for stateful services (databases), active-active introduces consistency challenges — concurrent writes to different nodes can conflict. Active-passive is simpler and ensures strong consistency for databases. Choose based on your consistency and complexity tolerance.
How do I handle split-brain in active-passive failover?
Split-brain occurs when both the primary and standby think they are the active node. Prevent this with: fencing (STONITH — Shoot The Other Node In The Head), quorum-based decisions (require a majority vote to become primary), and shared storage with locks. Cloud-managed databases (RDS, Cloud SQL) handle this automatically.
What is the minimum availability I should target?
It depends on your business. Calculate the cost of downtime per hour (lost revenue + customer trust + SLA penalties). If downtime costs less than the engineering investment to prevent it, your current availability is sufficient. Most B2B SaaS should target at least 99.9% (three nines). Consumer-facing applications often need 99.95% or higher.