Skip to main content
📈Scalability

Distributed Tracing: Observing Microservice Communication

Distributed tracing tracks requests as they flow through microservice architectures, providing visibility into latency, errors, and dependencies...

📖 10 min read

Distributed Tracing: Observing Microservice Communication

In a monolithic application, tracing a request from start to finish is straightforward — you follow a single call stack. In a microservices architecture, a single user request can traverse dozens of services, message queues, databases, and caches. When something goes wrong — a request is slow, an error bubbles up, or data appears inconsistent — pinpointing the root cause across these distributed boundaries becomes extremely challenging. Distributed tracing solves this by providing end-to-end visibility into request flows across service boundaries.

Distributed tracing is a critical pillar of observability, complementing metrics and logs to give teams a complete picture of system behavior. It connects closely to fault tolerance practices by revealing how failures propagate through service chains and to latency reduction efforts by showing where time is spent.

What Is Distributed Tracing?

Distributed tracing tracks the journey of a request as it flows through multiple services in a distributed system. Each request is assigned a unique trace ID at the entry point, and as the request moves through services, each operation creates a span — a unit of work with timing information, metadata, and parent-child relationships. Together, these spans form a trace that represents the complete request lifecycle.

Spans and Traces

A span represents a single operation within a trace. Each span contains:

Field Description Example
Trace ID Unique identifier for the entire request abc123def456
Span ID Unique identifier for this operation span-789
Parent Span ID ID of the calling span (empty for root) span-456
Operation Name Descriptive name of the operation GET /api/users
Start Time When the operation began 2024-01-15T10:30:00.123Z
Duration How long the operation took 45ms
Tags/Attributes Key-value metadata http.status_code=200
Events/Logs Timestamped annotations cache miss at 10:30:00.125
Status OK, ERROR, or UNSET OK

A trace is a tree of spans. The root span represents the initial request (e.g., an HTTP request from a client), and child spans represent downstream operations (database queries, API calls to other services, cache lookups).

Trace: abc123def456
|
+-- [root] API Gateway: GET /api/orders/42       (120ms)
    |
    +-- [child] Order Service: fetchOrder          (85ms)
    |   |
    |   +-- [child] Database: SELECT * FROM orders (12ms)
    |   +-- [child] Cache: GET order:42            (2ms)
    |
    +-- [child] User Service: getUser              (30ms)
    |   |
    |   +-- [child] Database: SELECT * FROM users  (8ms)
    |
    +-- [child] Inventory Service: checkStock      (25ms)
        |
        +-- [child] Redis: GET stock:item-99       (3ms)

Context Propagation

For distributed tracing to work, trace context must be propagated across service boundaries. When Service A calls Service B, the trace ID and parent span ID must be passed along so Service B can create child spans linked to the same trace.

Propagation Mechanisms

Context propagation typically uses HTTP headers for synchronous communication:

# W3C Trace Context (standard)
traceparent: 00-abc123def456-span789-01
tracestate: vendor=opaque-value

# B3 Propagation (Zipkin style)
X-B3-TraceId: abc123def456
X-B3-SpanId: span789
X-B3-ParentSpanId: span456
X-B3-Sampled: 1

# Jaeger Propagation
uber-trace-id: abc123def456:span789:span456:1

For asynchronous communication (message queues, event streams), trace context is embedded in message headers or metadata:

# Kafka message with trace context
Headers:
  traceparent: 00-abc123def456-span789-01
Key: order-42
Value: {"event": "order.created", "orderId": 42}

Correlation IDs

A correlation ID is a simpler concept that predates distributed tracing: a unique identifier attached to a request that is passed through all services and included in all log entries. While not as rich as full distributed tracing, correlation IDs provide basic request-level correlation across services:

// Middleware to extract or generate correlation ID
function correlationMiddleware(req, res, next) {
  const correlationId = req.headers["x-correlation-id"]
    || crypto.randomUUID();
  req.correlationId = correlationId;
  res.setHeader("X-Correlation-Id", correlationId);

  // Attach to logger context
  req.logger = logger.child({ correlationId });
  next();
}

// All logs from this request include the correlation ID
// [2024-01-15 10:30:00] correlationId=abc-123 msg="Processing order"
// [2024-01-15 10:30:00] correlationId=abc-123 msg="Fetching user data"
// [2024-01-15 10:30:01] correlationId=abc-123 msg="Order complete"

OpenTelemetry

OpenTelemetry (OTel) is the CNCF project that provides a unified standard for collecting traces, metrics, and logs. It is the industry-standard approach to instrumentation and has become the de facto standard for distributed tracing.

OpenTelemetry Architecture

Application Code
    |
    v
OTel SDK (auto + manual instrumentation)
    |
    v
OTel Exporter (OTLP, Jaeger, Zipkin format)
    |
    v
OTel Collector (receive, process, export)
    |
    +--> Jaeger (trace visualization)
    +--> Prometheus (metrics storage)
    +--> Elasticsearch (log storage)
    +--> Datadog / New Relic / etc.

Instrumenting with OpenTelemetry

// Node.js OpenTelemetry setup
const { NodeSDK } = require("@opentelemetry/sdk-node");
const { OTLPTraceExporter }
  = require("@opentelemetry/exporter-trace-otlp-http");
const { getNodeAutoInstrumentations }
  = require("@opentelemetry/auto-instrumentations-node");

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: "http://otel-collector:4318/v1/traces",
  }),
  instrumentations: [getNodeAutoInstrumentations()],
  serviceName: "order-service",
});
sdk.start();

// Manual span creation for custom operations
const { trace } = require("@opentelemetry/api");
const tracer = trace.getTracer("order-service");

async function processOrder(orderId) {
  return tracer.startActiveSpan("processOrder", async (span) => {
    span.setAttribute("order.id", orderId);
    try {
      const order = await fetchOrder(orderId);
      span.setAttribute("order.total", order.total);
      await validateInventory(order);
      await processPayment(order);
      span.setStatus({ code: SpanStatusCode.OK });
      return order;
    } catch (error) {
      span.setStatus({
        code: SpanStatusCode.ERROR,
        message: error.message,
      });
      span.recordException(error);
      throw error;
    } finally {
      span.end();
    }
  });
}

Jaeger and Zipkin

Jaeger

Jaeger, originally developed by Uber, is a popular open-source distributed tracing backend. It provides trace storage, querying, and a web UI for visualizing trace timelines and service dependency graphs. Jaeger supports multiple storage backends including Cassandra, Elasticsearch, and Kafka.

# Deploy Jaeger with Docker
docker run -d --name jaeger   -p 16686:16686   -p 4317:4317   -p 4318:4318   jaegertracing/all-in-one:latest

# Access the Jaeger UI at http://localhost:16686
# Send traces via OTLP on port 4317 (gRPC) or 4318 (HTTP)

Zipkin

Zipkin, originally developed by Twitter, is another widely-used distributed tracing system. It uses the B3 propagation format and provides a clean web UI for trace exploration. Zipkin is simpler to set up and is a good choice for smaller deployments:

# Deploy Zipkin with Docker
docker run -d -p 9411:9411 openzipkin/zipkin

# Access the Zipkin UI at http://localhost:9411
# Traces can be sent via HTTP POST to /api/v2/spans

Sampling Strategies

In high-throughput systems, tracing every single request is prohibitively expensive — both in terms of network overhead and storage costs. Sampling strategies determine which traces to collect:

Strategy Description Use Case
Head-based sampling Decision made at trace start (e.g., sample 10% of requests) Simple, predictable cost
Tail-based sampling Decision made after trace completes (e.g., keep all errors) Captures important traces
Rate-limited sampling Fixed number of traces per second Predictable storage costs
Priority sampling Higher sample rate for important operations Focus on critical paths
Adaptive sampling Adjusts rate based on traffic volume Handles traffic spikes
# OpenTelemetry Collector config with tail-based sampling
processors:
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors
        type: status_code
        status_code: {status_codes: [ERROR]}
      - name: slow-requests
        type: latency
        latency: {threshold_ms: 1000}
      - name: default
        type: probabilistic
        probabilistic: {sampling_percentage: 10}

Tail-based sampling is more powerful because it can make decisions based on the complete trace — for example, always keeping traces that contain errors or high latency, regardless of the overall sampling rate. However, it requires buffering traces until they complete, which adds complexity.

Integration with Logging and Metrics

Distributed tracing is most powerful when correlated with logs and metrics. The three signals together provide complete observability:

// Structured log with trace context
{
  "timestamp": "2024-01-15T10:30:00.123Z",
  "level": "error",
  "service": "order-service",
  "traceId": "abc123def456",
  "spanId": "span789",
  "message": "Payment processing failed",
  "error": "timeout after 5000ms",
  "orderId": 42,
  "userId": 1001
}

// From the trace, jump to related logs:
//   "Show me all logs for trace abc123def456"
// From the logs, jump to the trace:
//   "Show me the trace for this error"
// From metrics (error rate spike), find example traces:
//   "Show me traces with errors in the last 5 minutes"

Best Practices for Distributed Tracing

Instrumentation Guidelines

Use auto-instrumentation for standard libraries (HTTP clients, database drivers, message queue clients) and add manual instrumentation for business-critical operations. Name spans descriptively using the format "ServiceName.OperationName" or "HTTP METHOD /path". Add meaningful attributes like user IDs, order IDs, and operation results — but avoid adding sensitive data like passwords or credit card numbers.

Performance Considerations

Tracing adds overhead. Keep span creation lightweight, use sampling to control volume, and use asynchronous exporters to avoid blocking application threads. Batch span exports to reduce network calls. Monitor the overhead of your tracing infrastructure itself — it should not become a source of latency.

Organizational Practices

Standardize on a single tracing framework (OpenTelemetry is recommended) across all services. Define naming conventions for spans and attributes. Create runbooks that include trace analysis steps. Train all engineers on how to read trace visualizations and use the tracing UI to debug issues. Integrate trace links into alerting — when an alert fires, include a link to an example trace showing the problem.

Frequently Asked Questions

Q: How much overhead does distributed tracing add?

With proper sampling (1-10% of requests), the overhead is typically less than 1-2% of request latency and minimal CPU/memory impact. Auto-instrumentation libraries are optimized for low overhead. The main cost is in storage and network bandwidth for exporting spans. Use tail-based sampling to keep important traces without tracing everything.

Q: How do we trace across asynchronous boundaries?

For message queues and event-driven architectures, embed trace context in message headers. When a consumer processes a message, extract the trace context and create a new span linked to the original trace. OpenTelemetry provides built-in support for popular message brokers like Kafka, RabbitMQ, and SQS. This ensures traces span across both synchronous and asynchronous communication boundaries.

Q: What is the difference between tracing and logging?

Logging captures discrete events with context ("payment failed for order 42"). Tracing captures the flow of a request across services with timing and causality ("this request went from A to B to C, B took 500ms, and C returned an error"). They are complementary: use logs for detailed event data and traces for understanding request flow and identifying bottlenecks. Connecting them via trace IDs gives the best of both worlds.

Related Articles