SLA, SLO, and SLI: The Complete Guide to Service Level Metrics
In production systems, you need a precise language for defining, measuring, and communicating reliability. SLA, SLO, and SLI provide that language. They form a hierarchy: SLIs are the measurements, SLOs are the targets, and SLAs are the agreements. Understanding the difference is critical for building reliable systems and for system design interviews where you need to define what "good enough" looks like.
Definitions
SLI — Service Level Indicator
An SLI is a quantitative measurement of a specific aspect of service quality. It is the raw metric — what you actually measure. SLIs are typically expressed as a ratio of good events to total events.
SLI Examples:
Availability SLI:
(Successful requests / Total requests) × 100
Example: 99,950 successes / 100,000 total = 99.95%
Latency SLI:
(Requests completed within threshold / Total requests) × 100
Example: 98,500 requests under 200ms / 100,000 total = 98.5%
Error Rate SLI:
(Non-error responses / Total responses) × 100
Example: 99,900 non-5xx / 100,000 total = 99.9%
Throughput SLI:
Requests processed per second (measured over a window)
Example: Average 5,200 RPS over the last hour
SLO — Service Level Objective
An SLO is an internal target for an SLI. It defines what "good enough" means. SLOs are set by the engineering team based on user expectations and technical constraints. They are typically more strict than the external SLA to provide a buffer.
SLO Examples:
"99.95% of requests will return a successful response"
"99% of requests will complete within 200ms"
"99.9% of database queries will complete within 50ms"
"The system will process at least 1,000 messages per second"
SLA — Service Level Agreement
An SLA is a formal contract between a service provider and a customer that specifies consequences (typically financial) for failing to meet agreed-upon service levels. SLAs are legal documents, not engineering targets. They are typically less strict than SLOs.
SLA Example:
"The service will maintain 99.9% monthly uptime.
If uptime falls below 99.9%, the customer receives:
- 99.0% - 99.9%: 10% service credit
- 95.0% - 99.0%: 25% service credit
- Below 95.0%: 50% service credit"
How They Relate
| Concept | What It Is | Who Sets It | Audience | Consequences |
|---|---|---|---|---|
| SLI | The measurement | Engineering team | Internal | None (it is just data) |
| SLO | The target | Engineering + Product | Internal | Trigger engineering action |
| SLA | The contract | Business + Legal | External (customers) | Financial penalties / credits |
The Hierarchy:
SLI (what you measure): "Our availability this month is 99.97%"
SLO (what you target): "We target 99.95% availability"
SLA (what you promise): "We guarantee 99.9% availability to customers"
Note: SLO > SLA (always)
The internal target (SLO) is stricter than the external promise (SLA).
This gives you a buffer — you can miss your SLO and still meet your SLA.
Error Budgets
An error budget is the amount of unreliability you are allowed, derived from your SLO. If your SLO is 99.95% availability, your error budget is 0.05% — that is how much downtime or errors you can tolerate before violating the SLO.
Error Budget Calculation:
SLO: 99.95% availability per month
Error budget = 100% - 99.95% = 0.05%
In a 30-day month (43,200 minutes):
Error budget = 43,200 × 0.0005 = 21.6 minutes of downtime
If you have used 15 minutes already:
Remaining budget = 21.6 - 15 = 6.6 minutes
Budget consumed = 15 / 21.6 = 69.4%
Actions based on error budget:
0-50% consumed → Ship new features aggressively
50-75% consumed → Ship features but increase testing
75-100% consumed → Slow down releases, focus on reliability
100%+ consumed → Freeze all non-reliability deployments
Error budgets align engineering and product teams. Product wants to ship features (which risk introducing bugs). Engineering wants stability. The error budget gives both sides a quantitative framework: you can ship as fast as you want as long as you stay within budget.
Choosing the Right SLIs
Not all metrics make good SLIs. The best SLIs directly reflect what users care about.
| Service Type | Good SLIs | Poor SLIs |
|---|---|---|
| Web API | Availability, P99 latency, error rate | CPU utilization, memory usage |
| Data Pipeline | Throughput, freshness (data age), completeness | Queue depth, disk I/O |
| Storage System | Durability, availability, latency | Replication lag, IOPS |
| Streaming | Buffering ratio, start time, quality | Bandwidth, server count |
The key principle: CPU utilization going to 90% is not a problem if users are happy. Request errors going to 1% IS a problem even if CPU is at 20%. SLIs should measure the user experience, not internal server metrics.
Monitoring SLIs
# Example: Monitoring SLIs with Prometheus-style metrics
# Availability SLI
availability_sli = (
rate(http_requests_total{status!~"5.."}[5m])
/
rate(http_requests_total[5m])
) * 100
# Latency SLI (% of requests under 200ms)
latency_sli = (
rate(http_request_duration_seconds_bucket{le="0.2"}[5m])
/
rate(http_request_duration_seconds_count[5m])
) * 100
# Error Budget remaining (for a 30-day window, 99.95% SLO)
error_budget_remaining = 1 - (
(1 - availability_sli_30d) / (1 - 0.9995)
)
# Result: 1.0 = full budget, 0.0 = exhausted, negative = exceeded
Real-World SLA Examples
AWS
| Service | SLA | Credit for Breach |
|---|---|---|
| EC2 | 99.99% (region-level) | 10-30% credits based on severity |
| S3 | 99.9% availability | 10-25% credits |
| RDS Multi-AZ | 99.95% | 10-25% credits |
| Lambda | 99.95% | 10-25% credits |
Google Cloud
| Service | SLA | Notes |
|---|---|---|
| Compute Engine | 99.99% (multi-zone) | Individual instances: 99.5% |
| Cloud Spanner | 99.999% (multi-region) | Highest SLA of any cloud DB |
| Cloud Storage | 99.95% (multi-region) | 11 nines durability |
| BigQuery | 99.99% | For query execution |
Setting Your Own SLOs
SLO Setting Framework:
1. Identify what users care about
→ Availability, latency, correctness, freshness
2. Measure current performance (establish baseline)
→ "Our P99 latency is currently 350ms"
3. Set a target slightly above the baseline
→ "SLO: P99 latency under 400ms" (leaves margin)
4. Set the SLA below the SLO (external contract buffer)
→ "SLA: P99 latency under 500ms" (buffer for bad months)
5. Calculate error budget
→ "We can have 400ms+ latency for 0.1% of requests"
6. Set up monitoring and alerting
→ Alert at 50% budget consumed (early warning)
→ Alert at 75% budget consumed (take action)
→ Alert at 100% budget consumed (freeze deploys)
Common Mistake: Setting SLOs at 100%.
100% is impossible and leaves zero error budget.
Even Google targets 99.999%, not 100%.
SLIs, SLOs, and SLAs in System Design Interviews
In interviews, demonstrate that you think about measurability. When discussing non-functional requirements, frame them as SLOs:
Instead of "the system should be fast," say "our SLO is P99 latency under 200ms for the search API." Instead of "the system should be reliable," say "we target 99.95% availability measured by successful request ratio."
This shows the interviewer you think in concrete, measurable terms — exactly what production engineering requires.
Frequently Asked Questions
What happens when you exceed your error budget?
The standard practice (popularized by Google SRE) is to freeze all non-reliability feature deployments until the error budget recovers. This creates a natural incentive: ship reliable code or lose the ability to ship at all. Teams can also adjust by reducing deployment frequency, increasing testing, or rolling back recent changes that may have caused the regression.
Should every service have an SLA?
Not necessarily. Internal services often only need SLOs (internal targets without legal consequences). SLAs are for external-facing services where customers need contractual guarantees. However, every production service should have SLIs and SLOs — you cannot improve what you do not measure.
How do SLOs interact with high availability targets?
The availability SLO IS your HA target. If your SLO is 99.95% availability, you need HA mechanisms (fault tolerance, redundancy, automated failover) sufficient to achieve that target. The SLO quantifies what "highly available" means for your specific service.
What is the difference between SLO and KPI?
SLOs are reliability targets for engineering. KPIs (Key Performance Indicators) are business metrics (revenue, user growth, engagement). They are related — poor reliability (missing SLOs) usually hurts KPIs — but they measure different things. SLOs focus on "is the system working?" while KPIs focus on "is the business succeeding?"
How often should SLOs be reviewed and updated?
Review SLOs quarterly. If you consistently exceed your SLO easily, you might tighten it (or invest less in reliability and more in features). If you consistently miss it, either the SLO is too ambitious or you need to invest in reliability. SLOs should evolve with your system and user expectations.