Horizontal Scaling: Building Systems That Grow Outward
Horizontal scaling (scaling out) adds more machines to handle increased load, as opposed to vertical scaling (scaling up) which adds more power to a single machine. Horizontal scaling is the foundation of modern distributed systems because it offers near-unlimited growth potential, better fault tolerance, and cost efficiency. This guide covers the key strategies for designing horizontally scalable systems.
Horizontal vs Vertical Scaling
| Aspect | Horizontal (Scale Out) | Vertical (Scale Up) |
|---|---|---|
| Approach | Add more machines | Upgrade existing machine |
| Upper Limit | Practically unlimited | Hardware ceiling |
| Fault Tolerance | High — one node failure is survivable | Single point of failure |
| Complexity | Higher (distributed system challenges) | Lower (single machine) |
| Cost | Linear with many cheap machines | Exponential for high-end hardware |
| Downtime | Zero — add nodes while running | May require restart |
Designing Stateless Services
The first requirement for horizontal scaling is statelessness. A stateless service does not store any client-specific data between requests. Any instance can handle any request, making it trivial to add or remove instances.
# Stateful (BAD for horizontal scaling)
class StatefulServer:
def __init__(self):
self.sessions = {} # stored in memory
def handle_request(self, request):
session = self.sessions.get(request.session_id)
# If this request hits a different instance, session is lost!
# Stateless (GOOD for horizontal scaling)
class StatelessServer:
def __init__(self, redis_client):
self.redis = redis_client # external session store
def handle_request(self, request):
session = self.redis.get(f"session:{request.session_id}")
# Any instance can serve this request
Session Management Strategies
| Strategy | How It Works | Pros | Cons |
|---|---|---|---|
| Sticky Sessions | Load balancer routes user to same server | Simple, no shared state | Uneven load, data loss on failure |
| Distributed Sessions | Sessions stored in Redis/Memcached | Even load distribution | External dependency, latency |
| JWT Tokens | Session data encoded in token | No server-side state | Token size, cannot invalidate easily |
import jwt
import redis
from datetime import datetime, timedelta
# JWT-based stateless authentication
def create_jwt_token(user_id, roles):
payload = {
"sub": user_id,
"roles": roles,
"exp": datetime.utcnow() + timedelta(hours=1),
"iat": datetime.utcnow()
}
return jwt.encode(payload, SECRET_KEY, algorithm="HS256")
def verify_request(request):
token = request.headers.get("Authorization", "").replace("Bearer ", "")
try:
payload = jwt.decode(token, SECRET_KEY, algorithms=["HS256"])
return payload # Any server can verify this
except jwt.InvalidTokenError:
raise AuthenticationError("Invalid token")
Database Scaling
Read Replicas
class DatabaseRouter:
def __init__(self, primary, replicas):
self.primary = primary
self.replicas = replicas
self.replica_index = 0
def get_connection(self, operation="read"):
if operation == "write":
return self.primary
# Round-robin across read replicas
replica = self.replicas[self.replica_index % len(self.replicas)]
self.replica_index += 1
return replica
# Usage
router = DatabaseRouter(
primary="postgres://primary:5432/app",
replicas=[
"postgres://replica1:5432/app",
"postgres://replica2:5432/app",
"postgres://replica3:5432/app"
]
)
Database Sharding
class ShardRouter:
def __init__(self, shard_configs):
self.shards = shard_configs
self.num_shards = len(shard_configs)
def get_shard(self, user_id):
shard_index = hash(user_id) % self.num_shards
return self.shards[shard_index]
def query(self, user_id, sql, params):
shard = self.get_shard(user_id)
return shard.execute(sql, params)
def query_all_shards(self, sql, params):
"""For queries that span all shards (expensive)."""
results = []
for shard in self.shards:
results.extend(shard.execute(sql, params))
return results
Load Balancing Strategies
| Algorithm | Description | Best For |
|---|---|---|
| Round Robin | Distribute requests evenly in order | Homogeneous servers, equal request cost |
| Weighted Round Robin | More traffic to more powerful servers | Heterogeneous server sizes |
| Least Connections | Route to server with fewest active connections | Variable request processing times |
| Consistent Hashing | Hash-based routing to specific servers | Caching, stateful services |
Auto-Scaling Groups
# AWS Auto Scaling Group configuration (Terraform)
resource "aws_autoscaling_group" "web_servers" {
name = "web-asg"
min_size = 2
max_size = 20
desired_capacity = 4
vpc_zone_identifier = var.private_subnet_ids
launch_template {
id = aws_launch_template.web.id
version = "$Latest"
}
target_group_arns = [aws_lb_target_group.web.arn]
tag {
key = "Name"
value = "web-server"
propagate_at_launch = true
}
}
resource "aws_autoscaling_policy" "scale_up" {
name = "scale-up"
autoscaling_group_name = aws_autoscaling_group.web_servers.name
policy_type = "TargetTrackingScaling"
target_tracking_configuration {
predefined_metric_specification {
predefined_metric_type = "ASGAverageCPUUtilization"
}
target_value = 70.0
}
}
Horizontal scaling connects to auto-scaling for automatic capacity management, consistent hashing for data distribution, and load testing for validating scaling behavior. Use the System Design Calculator to estimate your scaling requirements.
Frequently Asked Questions
Q: When should I scale vertically instead of horizontally?
Scale vertically when your application is not easily parallelizable (single-threaded workloads, in-memory computation that needs large RAM), when the operational complexity of distributed systems is not justified, or as a quick fix while planning a horizontal scaling architecture.
Q: How do I handle database scaling with horizontal application scaling?
Start with read replicas for read-heavy workloads. Add connection pooling (PgBouncer, ProxySQL) to handle more application instances. For write-heavy workloads, consider sharding by a partition key. For maximum scalability, use a distributed database like CockroachDB, Vitess, or DynamoDB.
Q: What is the maximum number of instances I should run?
There is no universal limit, but consider: load balancer connection limits, database connection limits, deployment time (rolling updates take longer with more instances), and cost. Most applications scale well to 50-100 instances. Beyond that, consider microservice decomposition or architectural changes.
Q: How do I handle file storage in a horizontally scaled system?
Never store files on local disk of application servers. Use object storage (S3, Azure Blob, GCS) for file persistence. For temporary processing, use shared file systems (EFS, Azure Files) or process files in memory. This ensures any instance can serve any request regardless of which instance received the upload.