SLA, SLO, and SLI: The Complete Guide to Service Level Metrics

In production systems, you need a precise language for defining, measuring, and communicating reliability. SLA, SLO, and SLI provide that language. They form a hierarchy: SLIs are the measurements, SLOs are the targets, and SLAs are the agreements. Understanding the difference is critical for building reliable systems and for system design interviews where you need to define what "good enough" looks like.

Definitions

SLI — Service Level Indicator

An SLI is a quantitative measurement of a specific aspect of service quality. It is the raw metric — what you actually measure. SLIs are typically expressed as a ratio of good events to total events.

SLI Examples:

Availability SLI:
  (Successful requests / Total requests) × 100
  Example: 99,950 successes / 100,000 total = 99.95%

Latency SLI:
  (Requests completed within threshold / Total requests) × 100
  Example: 98,500 requests under 200ms / 100,000 total = 98.5%

Error Rate SLI:
  (Non-error responses / Total responses) × 100
  Example: 99,900 non-5xx / 100,000 total = 99.9%

Throughput SLI:
  Requests processed per second (measured over a window)
  Example: Average 5,200 RPS over the last hour

SLO — Service Level Objective

An SLO is an internal target for an SLI. It defines what "good enough" means. SLOs are set by the engineering team based on user expectations and technical constraints. They are typically more strict than the external SLA to provide a buffer.

SLO Examples:

"99.95% of requests will return a successful response"
"99% of requests will complete within 200ms"
"99.9% of database queries will complete within 50ms"
"The system will process at least 1,000 messages per second"

SLA — Service Level Agreement

An SLA is a formal contract between a service provider and a customer that specifies consequences (typically financial) for failing to meet agreed-upon service levels. SLAs are legal documents, not engineering targets. They are typically less strict than SLOs.

SLA Example:

"The service will maintain 99.9% monthly uptime.
 If uptime falls below 99.9%, the customer receives:
   - 99.0% - 99.9%:  10% service credit
   - 95.0% - 99.0%:  25% service credit
   - Below 95.0%:    50% service credit"

How They Relate

Concept	What It Is	Who Sets It	Audience	Consequences
SLI	The measurement	Engineering team	Internal	None (it is just data)
SLO	The target	Engineering + Product	Internal	Trigger engineering action
SLA	The contract	Business + Legal	External (customers)	Financial penalties / credits

The Hierarchy:

SLI (what you measure):  "Our availability this month is 99.97%"
SLO (what you target):   "We target 99.95% availability"
SLA (what you promise):  "We guarantee 99.9% availability to customers"

Note: SLO > SLA (always)
The internal target (SLO) is stricter than the external promise (SLA).
This gives you a buffer — you can miss your SLO and still meet your SLA.

Error Budgets

An error budget is the amount of unreliability you are allowed, derived from your SLO. If your SLO is 99.95% availability, your error budget is 0.05% — that is how much downtime or errors you can tolerate before violating the SLO.

Error Budget Calculation:

SLO: 99.95% availability per month

Error budget = 100% - 99.95% = 0.05%
In a 30-day month (43,200 minutes):
  Error budget = 43,200 × 0.0005 = 21.6 minutes of downtime

If you have used 15 minutes already:
  Remaining budget = 21.6 - 15 = 6.6 minutes
  Budget consumed = 15 / 21.6 = 69.4%

Actions based on error budget:
  0-50% consumed   → Ship new features aggressively
  50-75% consumed  → Ship features but increase testing
  75-100% consumed → Slow down releases, focus on reliability
  100%+ consumed   → Freeze all non-reliability deployments

Error budgets align engineering and product teams. Product wants to ship features (which risk introducing bugs). Engineering wants stability. The error budget gives both sides a quantitative framework: you can ship as fast as you want as long as you stay within budget.

Choosing the Right SLIs

Not all metrics make good SLIs. The best SLIs directly reflect what users care about.

Service Type	Good SLIs	Poor SLIs
Web API	Availability, P99 latency, error rate	CPU utilization, memory usage
Data Pipeline	Throughput, freshness (data age), completeness	Queue depth, disk I/O
Storage System	Durability, availability, latency	Replication lag, IOPS
Streaming	Buffering ratio, start time, quality	Bandwidth, server count

The key principle: CPU utilization going to 90% is not a problem if users are happy. Request errors going to 1% IS a problem even if CPU is at 20%. SLIs should measure the user experience, not internal server metrics.

Monitoring SLIs

# Example: Monitoring SLIs with Prometheus-style metrics

# Availability SLI
availability_sli = (
    rate(http_requests_total{status!~"5.."}[5m])
    /
    rate(http_requests_total[5m])
) * 100

# Latency SLI (% of requests under 200ms)
latency_sli = (
    rate(http_request_duration_seconds_bucket{le="0.2"}[5m])
    /
    rate(http_request_duration_seconds_count[5m])
) * 100

# Error Budget remaining (for a 30-day window, 99.95% SLO)
error_budget_remaining = 1 - (
    (1 - availability_sli_30d) / (1 - 0.9995)
)
# Result: 1.0 = full budget, 0.0 = exhausted, negative = exceeded

Real-World SLA Examples

AWS

Service	SLA	Credit for Breach
EC2	99.99% (region-level)	10-30% credits based on severity
S3	99.9% availability	10-25% credits
RDS Multi-AZ	99.95%	10-25% credits
Lambda	99.95%	10-25% credits

Google Cloud

Service	SLA	Notes
Compute Engine	99.99% (multi-zone)	Individual instances: 99.5%
Cloud Spanner	99.999% (multi-region)	Highest SLA of any cloud DB
Cloud Storage	99.95% (multi-region)	11 nines durability
BigQuery	99.99%	For query execution

Setting Your Own SLOs

SLO Setting Framework:

1. Identify what users care about
   → Availability, latency, correctness, freshness

2. Measure current performance (establish baseline)
   → "Our P99 latency is currently 350ms"

3. Set a target slightly above the baseline
   → "SLO: P99 latency under 400ms" (leaves margin)

4. Set the SLA below the SLO (external contract buffer)
   → "SLA: P99 latency under 500ms" (buffer for bad months)

5. Calculate error budget
   → "We can have 400ms+ latency for 0.1% of requests"

6. Set up monitoring and alerting
   → Alert at 50% budget consumed (early warning)
   → Alert at 75% budget consumed (take action)
   → Alert at 100% budget consumed (freeze deploys)

Common Mistake: Setting SLOs at 100%.
   100% is impossible and leaves zero error budget.
   Even Google targets 99.999%, not 100%.

SLIs, SLOs, and SLAs in System Design Interviews

In interviews, demonstrate that you think about measurability. When discussing non-functional requirements, frame them as SLOs:

Instead of "the system should be fast," say "our SLO is P99 latency under 200ms for the search API." Instead of "the system should be reliable," say "we target 99.95% availability measured by successful request ratio."

This shows the interviewer you think in concrete, measurable terms — exactly what production engineering requires.

Frequently Asked Questions

What happens when you exceed your error budget?

The standard practice (popularized by Google SRE) is to freeze all non-reliability feature deployments until the error budget recovers. This creates a natural incentive: ship reliable code or lose the ability to ship at all. Teams can also adjust by reducing deployment frequency, increasing testing, or rolling back recent changes that may have caused the regression.

Should every service have an SLA?

Not necessarily. Internal services often only need SLOs (internal targets without legal consequences). SLAs are for external-facing services where customers need contractual guarantees. However, every production service should have SLIs and SLOs — you cannot improve what you do not measure.

How do SLOs interact with high availability targets?

The availability SLO IS your HA target. If your SLO is 99.95% availability, you need HA mechanisms (fault tolerance, redundancy, automated failover) sufficient to achieve that target. The SLO quantifies what "highly available" means for your specific service.

What is the difference between SLO and KPI?

SLOs are reliability targets for engineering. KPIs (Key Performance Indicators) are business metrics (revenue, user growth, engagement). They are related — poor reliability (missing SLOs) usually hurts KPIs — but they measure different things. SLOs focus on "is the system working?" while KPIs focus on "is the business succeeding?"

How often should SLOs be reviewed and updated?

Review SLOs quarterly. If you consistently exceed your SLO easily, you might tighten it (or invest less in reliability and more in features). If you consistently miss it, either the SLO is too ambitious or you need to invest in reliability. SLOs should evolve with your system and user expectations.

SLA, SLO, and SLI: The Complete Guide to Service Level Metrics