Skip to main content
📐System Design Fundamentals

Latency vs Throughput: Understanding the Key Performance Trade-off

Latency and throughput are the two most fundamental performance metrics in system design. Every engineer must understand what they mean, how they relate to...

📖 9 min read

Latency vs Throughput: Understanding the Key Performance Trade-off

Latency and throughput are the two most fundamental performance metrics in system design. Every engineer must understand what they mean, how they relate to each other, and how to optimize for each. In system design interviews, these concepts come up in nearly every discussion — from database selection to API design to caching strategies.

This guide covers definitions, real-world numbers, the relationship between latency and throughput, optimization strategies, and practical examples.

Definitions

Latency

Latency is the time it takes for a single operation to complete, measured from the moment a request is sent to the moment the response is received. It is typically measured in milliseconds (ms) or microseconds (μs).

Think of latency as how long one person waits in line at a coffee shop. Even if the shop serves hundreds of people per hour, your individual wait time is the latency.

Throughput

Throughput is the number of operations a system can handle per unit of time. It is measured in requests per second (RPS), transactions per second (TPS), or data transferred per second (MB/s).

Think of throughput as how many customers the coffee shop serves per hour. A shop might serve 200 customers per hour (high throughput) even if each customer waits 5 minutes (moderate latency).

Bandwidth vs Throughput

Bandwidth is the maximum theoretical capacity of a channel. Throughput is the actual amount of data successfully transferred. Think of bandwidth as a highway's lane capacity and throughput as the actual number of cars that pass through per hour (always less due to congestion, accidents, and other factors).

Jeff Dean's Latency Numbers Every Engineer Should Know

These numbers, originally published by Jeff Dean at Google, provide essential intuition for back-of-the-envelope calculations. Use the SWE Helper tools to practice estimation problems with these numbers.

Operation Latency Scaled Analogy (1 ns = 1 sec)
L1 cache reference 0.5 ns 0.5 seconds
Branch mispredict 5 ns 5 seconds
L2 cache reference 7 ns 7 seconds
Mutex lock/unlock 25 ns 25 seconds
Main memory reference 100 ns 1.5 minutes
Compress 1KB with Snappy 3,000 ns (3 μs) 50 minutes
Send 1KB over 1 Gbps network 10,000 ns (10 μs) 2.75 hours
Read 4KB randomly from SSD 150,000 ns (150 μs) 1.7 days
Read 1MB sequentially from memory 250,000 ns (250 μs) 2.9 days
Round trip within same data center 500,000 ns (0.5 ms) 5.8 days
Read 1MB sequentially from SSD 1,000,000 ns (1 ms) 11.6 days
Disk seek 10,000,000 ns (10 ms) 116 days
Read 1MB sequentially from disk 20,000,000 ns (20 ms) 7.6 months
Send packet CA → Netherlands → CA 150,000,000 ns (150 ms) 4.75 years

Key takeaways from these numbers:

  • Memory is roughly 100x faster than SSD, which is roughly 100x faster than disk
  • Network round trips within a data center (~0.5 ms) are cheap; cross-continent (~150 ms) are expensive
  • Sequential reads are dramatically faster than random reads for all storage media
  • Compression is cheap — Snappy compresses 1KB in 3 μs

The Relationship Between Latency and Throughput

Latency and throughput are related but not inversely proportional. You can often improve one without hurting the other, but at the extremes, they do trade off.

Little's Law

Little's Law provides a mathematical relationship:

L = λ × W

Where:
  L = average number of items in the system (concurrency)
  λ = average arrival rate (throughput)
  W = average time an item spends in the system (latency)

Example:
  If your API handles 1000 requests/sec (λ)
  and each request takes 50ms (W = 0.05s)
  then you need: L = 1000 × 0.05 = 50 concurrent connections

  To handle 5000 requests/sec at the same latency:
  L = 5000 × 0.05 = 250 concurrent connections

How They Trade Off

As you push throughput higher, latency tends to increase — this is the queueing effect. When a system is at 10% utilization, requests are processed immediately. At 80% utilization, requests start queueing, increasing latency exponentially. At 100% utilization, latency approaches infinity.

Utilization vs Latency (approximate):

Utilization    Latency Multiplier
   10%              1.1x
   30%              1.4x
   50%              2.0x
   70%              3.3x
   80%              5.0x
   90%             10.0x
   95%             20.0x
   99%            100.0x

This is why production systems should run at 50-70% utilization,
not 90%+. The latency spike at high utilization is severe.

Measuring Latency: Percentiles Matter

Average latency is misleading. If 99 requests take 10ms and one request takes 10,000ms, the average is 109ms — but most users experienced 10ms. Use percentiles instead:

Metric Meaning Use Case
P50 (Median) 50% of requests are faster Typical user experience
P90 90% of requests are faster Most users' experience
P95 95% of requests are faster Common SLO target
P99 99% of requests are faster Tail latency monitoring
P99.9 99.9% of requests are faster Worst-case bound

Amazon found that every 100ms of added latency costs them 1% in sales. Google found that an extra 500ms in search latency dropped traffic by 20%. Tail latency matters because high-traffic users (your most valuable customers) are most likely to hit P99 latencies.

Optimization Strategies

Reducing Latency

  • Caching: Store frequently accessed data in memory. Redis or Memcached can serve reads in under 1ms compared to 10-50ms from a database.
  • CDNs: Serve static content from edge locations close to users, reducing network round trips.
  • Connection pooling: Reuse database and HTTP connections instead of establishing new ones for each request.
  • Data locality: Colocate data and compute. If your users are in Europe, have servers and database replicas in Europe.
  • Async processing: Move slow operations (email sending, image processing) to background queues.
  • Protocol optimization: Use HTTP/2 or gRPC instead of HTTP/1.1 for multiplexed connections.

Increasing Throughput

  • Horizontal scaling: Add more servers behind a load balancer.
  • Batching: Group multiple operations into one. Batch database inserts, batch API calls, batch message sends.
  • Parallelism: Process independent requests concurrently across multiple threads or cores.
  • Database indexing: Proper indexes can improve query throughput by 100x or more.
  • Partitioning/Sharding: Split data across multiple database instances so each handles a fraction of the load.
  • Compression: Reduce payload sizes to move more data through the same bandwidth.

Batching: A Classic Latency-Throughput Trade-off

# Without batching: Low latency, lower throughput
def insert_records_one_by_one(records):
    for record in records:
        db.execute("INSERT INTO events VALUES (%s, %s)", record)
    # 1000 records × 5ms per insert = 5000ms total
    # Throughput: 200 inserts/sec

# With batching: Higher latency per batch, much higher throughput
def insert_records_batched(records, batch_size=100):
    for i in range(0, len(records), batch_size):
        batch = records[i:i + batch_size]
        values = ", ".join(["(%s, %s)"] * len(batch))
        db.execute(f"INSERT INTO events VALUES {values}", flatten(batch))
    # 10 batches × 15ms per batch = 150ms total
    # Throughput: 6,667 inserts/sec (33x improvement!)

# Trade-off: Each individual record waits for its batch to fill
# before being sent, adding up to batch_interval of latency

Real-World Examples

Google optimizes aggressively for latency. Search results must return in under 200ms. They achieve this through massive caching, data locality (serving results from the nearest data center), and aggressive timeouts — if a backend service does not respond in time, the result is excluded rather than delaying the response.

Apache Kafka

Kafka optimizes for throughput. By batching messages, using sequential I/O, zero-copy transfers, and page cache, Kafka can process millions of messages per second. Individual message latency is higher than with point-to-point messaging, but the aggregate throughput is orders of magnitude greater.

Gaming Servers

Online gaming prioritizes latency above all else. A 50ms delay in a first-person shooter is noticeable and frustrating. Game servers use UDP instead of TCP to avoid retransmission delays, keep servers geographically close to players, and minimize payload sizes.

Latency and Throughput in System Design Interviews

When discussing performance in interviews, always clarify whether the priority is latency or throughput, then design accordingly:

Interview Framework:

1. Identify the performance priority
   - Real-time features → Optimize for latency
   - Data pipelines → Optimize for throughput
   - Most systems → Balance both

2. Set concrete targets
   - "P99 latency under 100ms"
   - "Handle 50,000 requests per second"

3. Design with numbers
   - Use Jeff Dean's latency numbers
   - Calculate capacity with Little's Law
   - Estimate storage with back-of-envelope math

Understanding latency and throughput connects directly to many other system design concepts including scalability, SLAs and SLOs, and system design trade-offs.

Frequently Asked Questions

Can you improve both latency and throughput at the same time?

Yes, in many cases. Adding a cache improves both — reads are faster (lower latency) and the database handles fewer requests (increasing effective throughput). Connection pooling, better algorithms, and hardware upgrades can also improve both. The trade-off appears mainly when you use batching (increases throughput, adds latency) or when the system is near capacity.

What is tail latency and why does it matter?

Tail latency refers to the highest latency values (P99, P99.9). In systems that make multiple backend calls per user request (fan-out), the overall latency is determined by the slowest call. If you make 100 parallel calls and each has a P99 of 100ms, there is a 63% chance at least one call will exceed 100ms. Tail latency compounds in distributed systems, making it critical to monitor and optimize.

What is a good target for API latency?

It depends on the use case. For user-facing web APIs, P50 under 100ms and P99 under 500ms is a common target. For real-time applications (gaming, trading), P99 under 10ms may be required. For background jobs and batch processing, latency in seconds is often acceptable. Always define your target based on user expectations and business impact.

How does network latency differ from application latency?

Network latency is the time data takes to travel between two points — governed by physics (speed of light) and network infrastructure. Application latency is the time your code takes to process a request. Total latency = network latency + application latency + queueing delays. You can optimize application latency with better code, but network latency has a hard floor: light takes about 67ms to travel from New York to London.

Why do production systems target 50-70% utilization instead of 100%?

Due to queueing theory, latency increases exponentially as utilization approaches 100%. At 90% utilization, latency is roughly 10x the baseline. At 99%, it is 100x. Running at 50-70% provides a comfortable buffer for traffic spikes while keeping latency reasonable. This is also why auto-scaling triggers well before 100% capacity — typically at 60-80% CPU utilization.

Related Articles