Auto Scaling: Dynamic Capacity Management for Cloud Systems
Auto scaling automatically adjusts the number of compute resources based on current demand. Instead of provisioning for peak traffic and wasting money during off-peak hours, auto scaling adds capacity when needed and removes it when demand drops. This guide covers the different scaling strategies, cloud provider implementations, and best practices for configuring auto scaling policies.
Types of Auto Scaling
| Type | Trigger | Best For | Latency |
|---|---|---|---|
| Reactive (Metrics-Based) | CPU, memory, request rate crosses threshold | Gradual traffic changes | Minutes (detection + provisioning) |
| Predictive | ML-based traffic prediction | Predictable patterns (daily, weekly) | Proactive — scales before traffic arrives |
| Scheduled | Time-based rules | Known events (launches, sales) | Zero — pre-provisioned |
Metrics-Based Scaling
The most common approach. Monitor key metrics and scale when thresholds are crossed.
# AWS Auto Scaling - Target Tracking Policy
# Maintains CPU at 70% by adding/removing instances
aws autoscaling put-scaling-policy \
--auto-scaling-group-name my-asg \
--policy-name cpu-target-tracking \
--policy-type TargetTrackingScaling \
--target-tracking-configuration '{
"PredefinedMetricSpecification": {
"PredefinedMetricType": "ASGAverageCPUUtilization"
},
"TargetValue": 70.0,
"ScaleInCooldown": 300,
"ScaleOutCooldown": 60
}'
Custom Metrics Scaling
import boto3
cloudwatch = boto3.client("cloudwatch")
# Publish custom metric: queue depth per instance
def publish_queue_metric(queue_depth, instance_count):
per_instance = queue_depth / max(instance_count, 1)
cloudwatch.put_metric_data(
Namespace="MyApp",
MetricData=[{
"MetricName": "QueueDepthPerInstance",
"Value": per_instance,
"Unit": "Count"
}]
)
# Scale based on queue depth per instance
# Target: 10 messages per instance
# If queue has 100 messages and 5 instances -> 20/instance -> scale up
# If queue has 30 messages and 5 instances -> 6/instance -> scale down
Scaling Policies
Target Tracking
The simplest and most recommended approach. You set a target value for a metric, and AWS adjusts capacity to maintain that target.
Step Scaling
Defines different scaling actions based on the magnitude of the alarm breach.
# Step Scaling Policy
{
"PolicyName": "step-scaling",
"PolicyType": "StepScaling",
"StepAdjustments": [
{
"MetricIntervalLowerBound": 0,
"MetricIntervalUpperBound": 20,
"ScalingAdjustment": 1
},
{
"MetricIntervalLowerBound": 20,
"MetricIntervalUpperBound": 40,
"ScalingAdjustment": 2
},
{
"MetricIntervalLowerBound": 40,
"ScalingAdjustment": 4
}
]
}
# CPU 70-90%: add 1 instance
# CPU 90-110% of threshold: add 2 instances
# CPU >110% of threshold: add 4 instances
Scheduled Scaling
# Scale up before business hours (Mon-Fri 8 AM)
aws autoscaling put-scheduled-update-group-action \
--auto-scaling-group-name my-asg \
--scheduled-action-name business-hours-up \
--recurrence "0 8 * * MON-FRI" \
--min-size 10 \
--desired-capacity 15
# Scale down after business hours (Mon-Fri 8 PM)
aws autoscaling put-scheduled-update-group-action \
--auto-scaling-group-name my-asg \
--scheduled-action-name business-hours-down \
--recurrence "0 20 * * MON-FRI" \
--min-size 2 \
--desired-capacity 4
Kubernetes Horizontal Pod Autoscaler
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web-server
minReplicas: 3
maxReplicas: 50
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "1000"
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 50
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 60
Cooldown Periods
Cooldown periods prevent the auto scaler from launching or terminating instances too quickly, avoiding oscillation (rapid scale up/down cycles).
| Cooldown Type | Purpose | Typical Value |
|---|---|---|
| Scale-Out Cooldown | Wait after adding instances before adding more | 60-120 seconds |
| Scale-In Cooldown | Wait after removing instances before removing more | 300-600 seconds |
| Warm-Up Period | Time for new instance to start serving traffic | 60-300 seconds |
Azure Virtual Machine Scale Sets
# Azure VMSS auto scale rule (ARM template snippet)
{
"type": "Microsoft.Insights/autoscaleSettings",
"properties": {
"profiles": [{
"name": "default",
"capacity": {
"minimum": "2",
"maximum": "20",
"default": "4"
},
"rules": [{
"metricTrigger": {
"metricName": "Percentage CPU",
"operator": "GreaterThan",
"threshold": 75,
"timeAggregation": "Average",
"timeWindow": "PT5M"
},
"scaleAction": {
"direction": "Increase",
"type": "ChangeCount",
"value": "2",
"cooldown": "PT5M"
}
}]
}]
}
}
Best Practices
- Scale out aggressively, scale in conservatively: Short scale-out cooldown (60s), long scale-in cooldown (300s+)
- Use multiple metrics: Combine CPU, memory, and custom metrics (request queue depth, latency)
- Pre-warm instances: Ensure instances are healthy before receiving traffic (health checks)
- Set appropriate min/max: Min handles baseline traffic; max prevents runaway costs
- Test your scaling: Use load testing to verify scaling behavior
Auto scaling connects to horizontal scaling architecture, load testing for validation, and high traffic handling for extreme scenarios.
Frequently Asked Questions
Q: How fast can auto scaling respond to traffic spikes?
AWS Auto Scaling typically takes 2-5 minutes to detect a metric breach and launch new instances. Instance boot time adds another 1-3 minutes. For faster response, use warm pools (pre-initialized instances) or Kubernetes which can scale pods in seconds.
Q: Should I use target tracking or step scaling?
Start with target tracking — it is simpler and works well for most cases. Use step scaling when you need different scaling responses for different severity levels (e.g., add 1 instance for moderate load, add 5 for extreme load).
Q: How do I prevent auto scaling from scaling down too aggressively?
Use a longer scale-in cooldown (5-10 minutes), set a stabilization window in Kubernetes HPA, and consider protecting specific instances from scale-in. Also ensure your scale-in metric has a buffer — scale in only when utilization drops well below the threshold.