Observability: Metrics, Logs, and Traces

Observability is the ability to understand the internal state of a system by examining its external outputs. In modern distributed systems, where failures can be subtle, intermittent, and emerge from complex interactions between services, observability is not a luxury — it is a requirement for operating reliable systems at scale. Unlike traditional monitoring, which tells you when something is wrong, observability helps you understand why it is wrong, even for failure modes you did not anticipate.

Observability connects to virtually every aspect of system reliability: fault tolerance depends on detecting failures quickly, auto-scaling relies on metrics to trigger scaling decisions, latency reduction requires tracing to identify bottlenecks, and load testing requires observability to interpret results.

Observability vs Monitoring

Monitoring and observability are related but distinct concepts:

Aspect	Monitoring	Observability
Approach	Predefined checks and dashboards	Exploratory, ad-hoc querying
Question type	"Is the system healthy?"	"Why is this request slow?"
Failure modes	Known, anticipated failures	Unknown, novel failures
Data model	Metrics and threshold-based alerts	High-cardinality, high-dimensionality data
Scope	"Is CPU above 90%?"	"Why are requests from region X to service Y slow on Tuesdays?"

Monitoring answers questions you know to ask. Observability gives you the tools to answer questions you have not yet thought of. A truly observable system lets you ask arbitrary questions about system behavior and get answers without deploying new code or instrumentation.

The Three Pillars of Observability

Observability rests on three complementary signal types: metrics, logs, and traces. Each provides a different lens into system behavior, and together they provide comprehensive visibility.

Metrics

Metrics are numeric measurements collected over time. They are aggregated, low-cardinality data that provide a quantitative view of system health. Metrics are ideal for dashboards, alerting, and trend analysis.

Metric Types

Type	Description	Example	Use Case
Counter	Monotonically increasing value	http_requests_total	Request counts, error counts, bytes transferred
Gauge	Value that can go up or down	active_connections	Queue depth, memory usage, temperature
Histogram	Distribution of values in buckets	request_duration_seconds	Latency percentiles, response sizes
Summary	Pre-calculated quantiles	request_duration_quantile	Client-side latency percentiles

# Prometheus metric examples

# Counter: total HTTP requests by method and status
http_requests_total{method="GET", status="200"} 150432
http_requests_total{method="GET", status="500"} 23
http_requests_total{method="POST", status="201"} 8921

# Gauge: current active database connections
db_connections_active{pool="primary"} 45
db_connections_active{pool="replica"} 12

# Histogram: request duration distribution
http_request_duration_seconds_bucket{le="0.05"} 24054
http_request_duration_seconds_bucket{le="0.1"} 33444
http_request_duration_seconds_bucket{le="0.25"} 100392
http_request_duration_seconds_bucket{le="0.5"} 129389
http_request_duration_seconds_bucket{le="1.0"} 133988
http_request_duration_seconds_bucket{le="+Inf"} 144320
http_request_duration_seconds_sum 53423.5
http_request_duration_seconds_count 144320

The RED and USE Methods

Two popular frameworks help teams choose which metrics to collect:

RED Method (for request-driven services):

Rate: Requests per second
Errors: Failed requests per second
Duration: Latency distribution (p50, p95, p99)

USE Method (for infrastructure resources):

Utilization: Percentage of resource busy (CPU at 75%)
Saturation: Amount of queued work (disk I/O queue depth)
Errors: Error events (network packet drops)

Structured Logging

Logs capture discrete events with context. While traditional plain-text logs are human-readable, they are difficult to query and analyze at scale. Structured logging outputs log entries as machine-parseable data (typically JSON), making them searchable and correlatable:

// Unstructured log (hard to query)
[2024-01-15 10:30:00] ERROR: Payment failed for order 42, user 1001

// Structured log (queryable)
{
  "timestamp": "2024-01-15T10:30:00.123Z",
  "level": "error",
  "service": "payment-service",
  "traceId": "abc123def456",
  "spanId": "span789",
  "message": "Payment processing failed",
  "context": {
    "orderId": 42,
    "userId": 1001,
    "amount": 99.99,
    "currency": "USD",
    "paymentMethod": "credit_card",
    "errorCode": "GATEWAY_TIMEOUT",
    "retryCount": 2,
    "duration_ms": 5023
  }
}

Structured logging best practices include: always include a trace ID for correlation with distributed traces, use consistent field names across services, include enough context to debug without looking at code, and avoid logging sensitive data (PII, credentials, tokens).

Distributed Traces

Traces track the journey of a request through multiple services. Each trace consists of spans that represent individual operations, forming a tree that shows the complete request lifecycle with timing information. Traces answer questions like "why was this specific request slow?" and "which service in the call chain caused the error?" For a deep dive into tracing, see distributed tracing.

Correlating Signals

The real power of observability comes from correlating metrics, logs, and traces together. A typical debugging workflow might look like:

Investigation Flow:

1. ALERT: Error rate exceeded 1% on order-service
   (Source: Metrics / Prometheus alert)
   |
   v
2. DASHBOARD: Error spike started at 10:25 AM,
   correlates with latency spike on payment-service
   (Source: Metrics / Grafana dashboard)
   |
   v
3. TRACES: Find example error traces from 10:25 AM
   -> Payment service spans show 5s timeout
   -> Downstream call to payment gateway timing out
   (Source: Traces / Jaeger)
   |
   v
4. LOGS: Filter logs by trace ID from error trace
   -> "Connection pool exhausted, 50/50 connections busy"
   -> "Payment gateway DNS resolution failed"
   (Source: Logs / Elasticsearch)
   |
   v
5. ROOT CAUSE: DNS issue causing payment gateway
   connection failures, exhausting connection pool,
   causing timeouts that propagate as order errors

The key to correlation is shared context. Trace IDs link logs to traces. Service names and timestamps link metrics to logs and traces. Labels and attributes enable cross-signal querying.

Alerting

Effective alerting is the bridge between observability data and human action. Poor alerting leads to alert fatigue — too many alerts that are not actionable erode trust and cause teams to ignore real problems.

Alerting Best Practices

# Good alert: based on user-facing SLO
# "Are users experiencing problems?"
- alert: HighErrorRate
  expr: |
    sum(rate(http_requests_total{status=~"5.."}[5m]))
    / sum(rate(http_requests_total[5m])) > 0.01
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Error rate exceeds 1% SLO"
    runbook: "https://wiki.example.com/runbooks/high-error-rate"

# Bad alert: based on cause, not symptom
# "Is a specific component unhealthy?"
- alert: HighCPU
  expr: node_cpu_usage > 0.90
  for: 1m
  # Problem: high CPU does not always mean user impact

Alert on symptoms (error rate, latency), not causes (CPU, memory). Use SLO-based alerting: define service level objectives and alert when error budgets are being consumed too quickly. Include runbook links in every alert so the on-call engineer knows exactly what to do.

Dashboards

Dashboards provide visual representations of system health. Effective dashboards follow a hierarchy:

Level	Audience	Content
Executive	Leadership, stakeholders	SLO status, error budgets, availability percentage
Service	Service team	RED metrics per endpoint, dependency health
Infrastructure	Platform team	USE metrics, node health, capacity planning
Debug	On-call engineers	Detailed breakdowns, trace exemplars, log panels

SRE Practices for Observability

SLIs, SLOs, and Error Budgets

Site Reliability Engineering (SRE) formalizes observability through service level indicators (SLIs), service level objectives (SLOs), and error budgets:

SLI (Service Level Indicator):
  "The proportion of successful HTTP requests"
  = successful_requests / total_requests

SLO (Service Level Objective):
  "99.9% of requests will succeed over a 30-day window"

Error Budget:
  = 1 - SLO = 0.1%
  = 0.001 * 30 days * 24 hours * 60 minutes
  = 43.2 minutes of allowed downtime per month

Burn Rate:
  If you consume 10% of your monthly error budget
  in 1 hour, that is a burn rate of 72x
  (10% / (1 hour / 720 hours) = 72)

SLO-based alerting triggers when the error budget burn rate exceeds a threshold, providing meaningful alerts that directly reflect user experience rather than infrastructure state.

Observability Tools

Prometheus and Grafana

Prometheus is the standard open-source metrics collection and storage system. It uses a pull-based model, scraping metrics endpoints at regular intervals. Grafana provides visualization, dashboarding, and alerting on top of Prometheus and many other data sources. Together they form the backbone of most open-source observability stacks.

# prometheus.yml configuration
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: "api-server"
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true

ELK Stack (Elasticsearch, Logstash, Kibana)

The ELK stack is the most widely used open-source logging solution. Elasticsearch stores and indexes log data, Logstash (or Fluentd/Fluent Bit) collects and transforms logs, and Kibana provides search, visualization, and dashboarding. For high-volume systems, consider using Fluent Bit as a lightweight log forwarder that sends data to Elasticsearch.

Datadog

Datadog is a commercial observability platform that provides unified metrics, logs, and traces in a single platform. Its strength is correlation — clicking from a metric spike to related traces to associated logs is seamless. Datadog also offers APM, infrastructure monitoring, synthetic monitoring, and real user monitoring (RUM).

Tools Comparison

Tool	Signal	Type	Strengths
Prometheus	Metrics	Open Source	Industry standard, PromQL, ecosystem
Grafana	Visualization	Open Source	Multi-source dashboards, alerting
Elasticsearch	Logs	Open Source	Full-text search, scalable storage
Jaeger	Traces	Open Source	Trace visualization, service maps
Datadog	All three	Commercial	Unified platform, correlation, AI
New Relic	All three	Commercial	APM, error tracking, NRQL queries
Grafana Cloud	All three	Commercial	Managed open-source stack (Mimir, Loki, Tempo)

Implementing Observability: A Practical Guide

Instrumentation Checklist

For every service, instrument:

1. HTTP/gRPC endpoints
   - Request rate, error rate, latency (RED)
   - Status code distribution
   - Request/response size

2. Database operations
   - Query duration, error rate
   - Connection pool utilization
   - Slow query logging

3. External dependencies
   - Call rate, error rate, latency per dependency
   - Circuit breaker state changes
   - Retry counts

4. Business metrics
   - Orders processed per minute
   - Payment success/failure rate
   - User signups, conversions

5. Infrastructure
   - CPU, memory, disk, network (USE)
   - Container restarts, OOM kills
   - Pod scheduling latency

Frequently Asked Questions

Q: How much does observability cost?

Costs depend on data volume. For open-source stacks (Prometheus + ELK + Jaeger), the cost is primarily infrastructure — compute and storage for running the tools. For commercial platforms, costs are typically based on ingestion volume (metrics per minute, log GB per day, trace spans per month). Use sampling, aggregation, and retention policies to control costs. A common strategy is to keep high-resolution data for 7-14 days and downsample older data.

Q: How do we avoid alert fatigue?

Alert on symptoms (user-facing error rates, SLO burn rates) rather than causes (CPU usage, memory). Every alert should be actionable — if the on-call engineer cannot do anything about it, it should not page them. Use severity levels: critical alerts page immediately, warning alerts create tickets. Review and prune alerts regularly. If an alert fires frequently but is always a false positive, fix the alert threshold or remove it.

Q: Should we use open-source or commercial observability tools?

Open-source tools (Prometheus, Grafana, Jaeger, ELK) offer full control and no per-unit costs, but require operational expertise to run and scale. Commercial platforms (Datadog, New Relic, Grafana Cloud) provide managed infrastructure, better correlation features, and faster time to value, but can become expensive at scale. Many teams use a hybrid approach: open-source for metrics (Prometheus is hard to beat) and commercial tools for traces and log management where operational complexity is higher. Evaluate based on your team size, operational maturity, and budget. The best observability tool is the one your team actually uses effectively.

Observability: Metrics, Logs, and Traces