Skip to main content
📈Scalability

Auto Scaling: Dynamic Capacity Management for Cloud Systems

Auto scaling automatically adjusts the number of compute resources based on current demand. Instead of provisioning for peak traffic and wasting money duri...

📖 5 min read

Auto Scaling: Dynamic Capacity Management for Cloud Systems

Auto scaling automatically adjusts the number of compute resources based on current demand. Instead of provisioning for peak traffic and wasting money during off-peak hours, auto scaling adds capacity when needed and removes it when demand drops. This guide covers the different scaling strategies, cloud provider implementations, and best practices for configuring auto scaling policies.

Types of Auto Scaling

Type Trigger Best For Latency
Reactive (Metrics-Based) CPU, memory, request rate crosses threshold Gradual traffic changes Minutes (detection + provisioning)
Predictive ML-based traffic prediction Predictable patterns (daily, weekly) Proactive — scales before traffic arrives
Scheduled Time-based rules Known events (launches, sales) Zero — pre-provisioned

Metrics-Based Scaling

The most common approach. Monitor key metrics and scale when thresholds are crossed.

# AWS Auto Scaling - Target Tracking Policy
# Maintains CPU at 70% by adding/removing instances
aws autoscaling put-scaling-policy \
  --auto-scaling-group-name my-asg \
  --policy-name cpu-target-tracking \
  --policy-type TargetTrackingScaling \
  --target-tracking-configuration '{
    "PredefinedMetricSpecification": {
      "PredefinedMetricType": "ASGAverageCPUUtilization"
    },
    "TargetValue": 70.0,
    "ScaleInCooldown": 300,
    "ScaleOutCooldown": 60
  }'

Custom Metrics Scaling

import boto3

cloudwatch = boto3.client("cloudwatch")

# Publish custom metric: queue depth per instance
def publish_queue_metric(queue_depth, instance_count):
    per_instance = queue_depth / max(instance_count, 1)
    cloudwatch.put_metric_data(
        Namespace="MyApp",
        MetricData=[{
            "MetricName": "QueueDepthPerInstance",
            "Value": per_instance,
            "Unit": "Count"
        }]
    )

# Scale based on queue depth per instance
# Target: 10 messages per instance
# If queue has 100 messages and 5 instances -> 20/instance -> scale up
# If queue has 30 messages and 5 instances -> 6/instance -> scale down

Scaling Policies

Target Tracking

The simplest and most recommended approach. You set a target value for a metric, and AWS adjusts capacity to maintain that target.

Step Scaling

Defines different scaling actions based on the magnitude of the alarm breach.

# Step Scaling Policy
{
  "PolicyName": "step-scaling",
  "PolicyType": "StepScaling",
  "StepAdjustments": [
    {
      "MetricIntervalLowerBound": 0,
      "MetricIntervalUpperBound": 20,
      "ScalingAdjustment": 1
    },
    {
      "MetricIntervalLowerBound": 20,
      "MetricIntervalUpperBound": 40,
      "ScalingAdjustment": 2
    },
    {
      "MetricIntervalLowerBound": 40,
      "ScalingAdjustment": 4
    }
  ]
}
# CPU 70-90%: add 1 instance
# CPU 90-110% of threshold: add 2 instances
# CPU >110% of threshold: add 4 instances

Scheduled Scaling

# Scale up before business hours (Mon-Fri 8 AM)
aws autoscaling put-scheduled-update-group-action \
  --auto-scaling-group-name my-asg \
  --scheduled-action-name business-hours-up \
  --recurrence "0 8 * * MON-FRI" \
  --min-size 10 \
  --desired-capacity 15

# Scale down after business hours (Mon-Fri 8 PM)
aws autoscaling put-scheduled-update-group-action \
  --auto-scaling-group-name my-asg \
  --scheduled-action-name business-hours-down \
  --recurrence "0 20 * * MON-FRI" \
  --min-size 2 \
  --desired-capacity 4

Kubernetes Horizontal Pod Autoscaler

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-server
  minReplicas: 3
  maxReplicas: 50
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: "1000"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Percent
          value: 50
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 10
          periodSeconds: 60

Cooldown Periods

Cooldown periods prevent the auto scaler from launching or terminating instances too quickly, avoiding oscillation (rapid scale up/down cycles).

Cooldown Type Purpose Typical Value
Scale-Out Cooldown Wait after adding instances before adding more 60-120 seconds
Scale-In Cooldown Wait after removing instances before removing more 300-600 seconds
Warm-Up Period Time for new instance to start serving traffic 60-300 seconds

Azure Virtual Machine Scale Sets

# Azure VMSS auto scale rule (ARM template snippet)
{
  "type": "Microsoft.Insights/autoscaleSettings",
  "properties": {
    "profiles": [{
      "name": "default",
      "capacity": {
        "minimum": "2",
        "maximum": "20",
        "default": "4"
      },
      "rules": [{
        "metricTrigger": {
          "metricName": "Percentage CPU",
          "operator": "GreaterThan",
          "threshold": 75,
          "timeAggregation": "Average",
          "timeWindow": "PT5M"
        },
        "scaleAction": {
          "direction": "Increase",
          "type": "ChangeCount",
          "value": "2",
          "cooldown": "PT5M"
        }
      }]
    }]
  }
}

Best Practices

  • Scale out aggressively, scale in conservatively: Short scale-out cooldown (60s), long scale-in cooldown (300s+)
  • Use multiple metrics: Combine CPU, memory, and custom metrics (request queue depth, latency)
  • Pre-warm instances: Ensure instances are healthy before receiving traffic (health checks)
  • Set appropriate min/max: Min handles baseline traffic; max prevents runaway costs
  • Test your scaling: Use load testing to verify scaling behavior

Auto scaling connects to horizontal scaling architecture, load testing for validation, and high traffic handling for extreme scenarios.

Frequently Asked Questions

Q: How fast can auto scaling respond to traffic spikes?

AWS Auto Scaling typically takes 2-5 minutes to detect a metric breach and launch new instances. Instance boot time adds another 1-3 minutes. For faster response, use warm pools (pre-initialized instances) or Kubernetes which can scale pods in seconds.

Q: Should I use target tracking or step scaling?

Start with target tracking — it is simpler and works well for most cases. Use step scaling when you need different scaling responses for different severity levels (e.g., add 1 instance for moderate load, add 5 for extreme load).

Q: How do I prevent auto scaling from scaling down too aggressively?

Use a longer scale-in cooldown (5-10 minutes), set a stabilization window in Kubernetes HPA, and consider protecting specific instances from scale-in. Also ensure your scale-in metric has a buffer — scale in only when utilization drops well below the threshold.

Related Articles