Skip to main content
📈Scalability

Latency Reduction: Techniques for Faster Distributed Systems

Latency is the time between a request being sent and the response being received. In distributed systems, latency compounds across multiple hops — a 50ms d...

📖 4 min read

Latency Reduction: Techniques for Faster Distributed Systems

Latency is the time between a request being sent and the response being received. In distributed systems, latency compounds across multiple hops — a 50ms database query plus a 30ms API call plus 20ms of processing quickly adds up. Reducing latency improves user experience, conversion rates, and system throughput. This guide covers practical techniques for minimizing latency at every layer of your stack.

Understanding Latency Sources

Source Typical Latency Optimization
DNS Resolution 20-120ms DNS caching, prefetch
TCP Handshake 1 RTT (10-200ms) Connection reuse, keep-alive
TLS Handshake 1-2 RTTs (20-400ms) TLS session resumption, HTTP/2
Network Transit 1-200ms (depends on distance) CDN, edge computing, geo-routing
Server Processing 5-500ms Code optimization, caching
Database Query 1-1000ms Indexes, query optimization, caching

CDN Strategies

Content Delivery Networks cache content at edge locations close to users, eliminating the round trip to your origin server.

# CloudFront distribution with caching rules
resource "aws_cloudfront_distribution" "web" {
  origin {
    domain_name = aws_lb.web.dns_name
    origin_id   = "web-origin"

    custom_origin_config {
      http_port              = 80
      https_port             = 443
      origin_protocol_policy = "https-only"
    }
  }

  default_cache_behavior {
    target_origin_id       = "web-origin"
    viewer_protocol_policy = "redirect-to-https"
    allowed_methods        = ["GET", "HEAD"]
    cached_methods         = ["GET", "HEAD"]

    forwarded_values {
      query_string = false
      cookies { forward = "none" }
    }

    min_ttl     = 0
    default_ttl = 86400   # 24 hours
    max_ttl     = 31536000 # 1 year
    compress    = true     # Enable gzip/brotli
  }

  # API routes bypass cache
  ordered_cache_behavior {
    path_pattern           = "/api/*"
    target_origin_id       = "web-origin"
    viewer_protocol_policy = "https-only"
    allowed_methods        = ["GET", "HEAD", "OPTIONS", "PUT",
                             "POST", "PATCH", "DELETE"]
    min_ttl = 0
    default_ttl = 0
    max_ttl = 0
  }
}

Connection Reuse

import requests
from urllib3.util.retry import Retry
from requests.adapters import HTTPAdapter

# Reuse connections with session
session = requests.Session()
adapter = HTTPAdapter(
    pool_connections=10,    # Connection pools per host
    pool_maxsize=20,        # Connections per pool
    max_retries=Retry(total=3, backoff_factor=0.1)
)
session.mount("https://", adapter)

# All requests reuse TCP connections (keep-alive)
response1 = session.get("https://api.example.com/users")   # New connection
response2 = session.get("https://api.example.com/orders")  # Reuses connection

Compression

# Nginx compression configuration
gzip on;
gzip_vary on;
gzip_proxied any;
gzip_comp_level 6;
gzip_types text/plain text/css application/json
           application/javascript text/xml application/xml
           application/xml+rss text/javascript;
gzip_min_length 256;

# Brotli (better compression than gzip)
brotli on;
brotli_comp_level 6;
brotli_types text/plain text/css application/json
             application/javascript text/xml;
Algorithm Compression Ratio Speed Support
gzip Good (60-80%) Fast Universal
Brotli Better (70-85%) Moderate Modern browsers
zstd Best ratio/speed Very fast Growing (server-to-server)

Database Optimization

-- Identify slow queries
SELECT query, calls, mean_exec_time, total_exec_time
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 10;

-- Add indexes for common queries
CREATE INDEX idx_orders_customer_date
ON orders (customer_id, created_at DESC);

-- Use covering indexes to avoid table lookups
CREATE INDEX idx_orders_covering
ON orders (customer_id, created_at DESC)
INCLUDE (total_amount, status);

-- Denormalize for read-heavy paths
ALTER TABLE orders ADD COLUMN customer_name VARCHAR(100);
-- Update on write, avoid JOIN on read

Prefetching and Preloading

# Server-side prefetching
class ProductPageHandler:
    async def handle(self, request):
        product_id = request.params["id"]

        # Fetch all needed data in parallel
        product, reviews, related, inventory = await asyncio.gather(
            self.product_service.get(product_id),
            self.review_service.get_for_product(product_id),
            self.recommendation_service.get_related(product_id),
            self.inventory_service.check(product_id)
        )
        # Total latency = max(individual latencies) instead of sum

        return render_template("product.html",
            product=product, reviews=reviews,
            related=related, inventory=inventory)

Measuring Latency

Always measure percentiles, not averages. The p50 (median) shows typical experience; p95 and p99 show worst-case experience for the tail users who are often your most engaged customers.

import time
from collections import defaultdict
import numpy as np

class LatencyTracker:
    def __init__(self):
        self.measurements = defaultdict(list)

    def record(self, endpoint, duration_ms):
        self.measurements[endpoint].append(duration_ms)

    def report(self, endpoint):
        data = self.measurements[endpoint]
        return {
            "p50": np.percentile(data, 50),
            "p95": np.percentile(data, 95),
            "p99": np.percentile(data, 99),
            "mean": np.mean(data),
            "count": len(data)
        }

Latency reduction connects to performance optimization, edge computing, and geo-distribution strategies.

Frequently Asked Questions

Q: What is the biggest single latency reduction I can make?

Caching. A cache hit takes 1-5ms versus 50-500ms for a database query or external API call. If your system is read-heavy, adding a Redis cache in front of your database can reduce p95 latency by 80% or more.

Q: How does HTTP/2 reduce latency?

HTTP/2 allows multiplexing multiple requests over a single TCP connection, eliminating head-of-line blocking. It also supports server push and header compression. HTTP/3 (QUIC) goes further by using UDP, eliminating TCP head-of-line blocking entirely.

Q: What latency targets should I set?

For user-facing APIs: p50 < 100ms, p95 < 300ms, p99 < 1000ms. For internal service-to-service: p50 < 10ms, p99 < 100ms. Google research shows that a 200ms increase in search latency reduces queries by 0.6%. Amazon found that every 100ms of latency cost them 1% in revenue.

Related Articles