Latency vs Throughput: Understanding the Key Performance Trade-off
Latency and throughput are the two most fundamental performance metrics in system design. Every engineer must understand what they mean, how they relate to each other, and how to optimize for each. In system design interviews, these concepts come up in nearly every discussion — from database selection to API design to caching strategies.
This guide covers definitions, real-world numbers, the relationship between latency and throughput, optimization strategies, and practical examples.
Definitions
Latency
Latency is the time it takes for a single operation to complete, measured from the moment a request is sent to the moment the response is received. It is typically measured in milliseconds (ms) or microseconds (μs).
Think of latency as how long one person waits in line at a coffee shop. Even if the shop serves hundreds of people per hour, your individual wait time is the latency.
Throughput
Throughput is the number of operations a system can handle per unit of time. It is measured in requests per second (RPS), transactions per second (TPS), or data transferred per second (MB/s).
Think of throughput as how many customers the coffee shop serves per hour. A shop might serve 200 customers per hour (high throughput) even if each customer waits 5 minutes (moderate latency).
Bandwidth vs Throughput
Bandwidth is the maximum theoretical capacity of a channel. Throughput is the actual amount of data successfully transferred. Think of bandwidth as a highway's lane capacity and throughput as the actual number of cars that pass through per hour (always less due to congestion, accidents, and other factors).
Jeff Dean's Latency Numbers Every Engineer Should Know
These numbers, originally published by Jeff Dean at Google, provide essential intuition for back-of-the-envelope calculations. Use the SWE Helper tools to practice estimation problems with these numbers.
| Operation | Latency | Scaled Analogy (1 ns = 1 sec) |
|---|---|---|
| L1 cache reference | 0.5 ns | 0.5 seconds |
| Branch mispredict | 5 ns | 5 seconds |
| L2 cache reference | 7 ns | 7 seconds |
| Mutex lock/unlock | 25 ns | 25 seconds |
| Main memory reference | 100 ns | 1.5 minutes |
| Compress 1KB with Snappy | 3,000 ns (3 μs) | 50 minutes |
| Send 1KB over 1 Gbps network | 10,000 ns (10 μs) | 2.75 hours |
| Read 4KB randomly from SSD | 150,000 ns (150 μs) | 1.7 days |
| Read 1MB sequentially from memory | 250,000 ns (250 μs) | 2.9 days |
| Round trip within same data center | 500,000 ns (0.5 ms) | 5.8 days |
| Read 1MB sequentially from SSD | 1,000,000 ns (1 ms) | 11.6 days |
| Disk seek | 10,000,000 ns (10 ms) | 116 days |
| Read 1MB sequentially from disk | 20,000,000 ns (20 ms) | 7.6 months |
| Send packet CA → Netherlands → CA | 150,000,000 ns (150 ms) | 4.75 years |
Key takeaways from these numbers:
- Memory is roughly 100x faster than SSD, which is roughly 100x faster than disk
- Network round trips within a data center (~0.5 ms) are cheap; cross-continent (~150 ms) are expensive
- Sequential reads are dramatically faster than random reads for all storage media
- Compression is cheap — Snappy compresses 1KB in 3 μs
The Relationship Between Latency and Throughput
Latency and throughput are related but not inversely proportional. You can often improve one without hurting the other, but at the extremes, they do trade off.
Little's Law
Little's Law provides a mathematical relationship:
L = λ × W
Where:
L = average number of items in the system (concurrency)
λ = average arrival rate (throughput)
W = average time an item spends in the system (latency)
Example:
If your API handles 1000 requests/sec (λ)
and each request takes 50ms (W = 0.05s)
then you need: L = 1000 × 0.05 = 50 concurrent connections
To handle 5000 requests/sec at the same latency:
L = 5000 × 0.05 = 250 concurrent connections
How They Trade Off
As you push throughput higher, latency tends to increase — this is the queueing effect. When a system is at 10% utilization, requests are processed immediately. At 80% utilization, requests start queueing, increasing latency exponentially. At 100% utilization, latency approaches infinity.
Utilization vs Latency (approximate):
Utilization Latency Multiplier
10% 1.1x
30% 1.4x
50% 2.0x
70% 3.3x
80% 5.0x
90% 10.0x
95% 20.0x
99% 100.0x
This is why production systems should run at 50-70% utilization,
not 90%+. The latency spike at high utilization is severe.
Measuring Latency: Percentiles Matter
Average latency is misleading. If 99 requests take 10ms and one request takes 10,000ms, the average is 109ms — but most users experienced 10ms. Use percentiles instead:
| Metric | Meaning | Use Case |
|---|---|---|
| P50 (Median) | 50% of requests are faster | Typical user experience |
| P90 | 90% of requests are faster | Most users' experience |
| P95 | 95% of requests are faster | Common SLO target |
| P99 | 99% of requests are faster | Tail latency monitoring |
| P99.9 | 99.9% of requests are faster | Worst-case bound |
Amazon found that every 100ms of added latency costs them 1% in sales. Google found that an extra 500ms in search latency dropped traffic by 20%. Tail latency matters because high-traffic users (your most valuable customers) are most likely to hit P99 latencies.
Optimization Strategies
Reducing Latency
- Caching: Store frequently accessed data in memory. Redis or Memcached can serve reads in under 1ms compared to 10-50ms from a database.
- CDNs: Serve static content from edge locations close to users, reducing network round trips.
- Connection pooling: Reuse database and HTTP connections instead of establishing new ones for each request.
- Data locality: Colocate data and compute. If your users are in Europe, have servers and database replicas in Europe.
- Async processing: Move slow operations (email sending, image processing) to background queues.
- Protocol optimization: Use HTTP/2 or gRPC instead of HTTP/1.1 for multiplexed connections.
Increasing Throughput
- Horizontal scaling: Add more servers behind a load balancer.
- Batching: Group multiple operations into one. Batch database inserts, batch API calls, batch message sends.
- Parallelism: Process independent requests concurrently across multiple threads or cores.
- Database indexing: Proper indexes can improve query throughput by 100x or more.
- Partitioning/Sharding: Split data across multiple database instances so each handles a fraction of the load.
- Compression: Reduce payload sizes to move more data through the same bandwidth.
Batching: A Classic Latency-Throughput Trade-off
# Without batching: Low latency, lower throughput
def insert_records_one_by_one(records):
for record in records:
db.execute("INSERT INTO events VALUES (%s, %s)", record)
# 1000 records × 5ms per insert = 5000ms total
# Throughput: 200 inserts/sec
# With batching: Higher latency per batch, much higher throughput
def insert_records_batched(records, batch_size=100):
for i in range(0, len(records), batch_size):
batch = records[i:i + batch_size]
values = ", ".join(["(%s, %s)"] * len(batch))
db.execute(f"INSERT INTO events VALUES {values}", flatten(batch))
# 10 batches × 15ms per batch = 150ms total
# Throughput: 6,667 inserts/sec (33x improvement!)
# Trade-off: Each individual record waits for its batch to fill
# before being sent, adding up to batch_interval of latency
Real-World Examples
Google Search
Google optimizes aggressively for latency. Search results must return in under 200ms. They achieve this through massive caching, data locality (serving results from the nearest data center), and aggressive timeouts — if a backend service does not respond in time, the result is excluded rather than delaying the response.
Apache Kafka
Kafka optimizes for throughput. By batching messages, using sequential I/O, zero-copy transfers, and page cache, Kafka can process millions of messages per second. Individual message latency is higher than with point-to-point messaging, but the aggregate throughput is orders of magnitude greater.
Gaming Servers
Online gaming prioritizes latency above all else. A 50ms delay in a first-person shooter is noticeable and frustrating. Game servers use UDP instead of TCP to avoid retransmission delays, keep servers geographically close to players, and minimize payload sizes.
Latency and Throughput in System Design Interviews
When discussing performance in interviews, always clarify whether the priority is latency or throughput, then design accordingly:
Interview Framework:
1. Identify the performance priority
- Real-time features → Optimize for latency
- Data pipelines → Optimize for throughput
- Most systems → Balance both
2. Set concrete targets
- "P99 latency under 100ms"
- "Handle 50,000 requests per second"
3. Design with numbers
- Use Jeff Dean's latency numbers
- Calculate capacity with Little's Law
- Estimate storage with back-of-envelope math
Understanding latency and throughput connects directly to many other system design concepts including scalability, SLAs and SLOs, and system design trade-offs.
Frequently Asked Questions
Can you improve both latency and throughput at the same time?
Yes, in many cases. Adding a cache improves both — reads are faster (lower latency) and the database handles fewer requests (increasing effective throughput). Connection pooling, better algorithms, and hardware upgrades can also improve both. The trade-off appears mainly when you use batching (increases throughput, adds latency) or when the system is near capacity.
What is tail latency and why does it matter?
Tail latency refers to the highest latency values (P99, P99.9). In systems that make multiple backend calls per user request (fan-out), the overall latency is determined by the slowest call. If you make 100 parallel calls and each has a P99 of 100ms, there is a 63% chance at least one call will exceed 100ms. Tail latency compounds in distributed systems, making it critical to monitor and optimize.
What is a good target for API latency?
It depends on the use case. For user-facing web APIs, P50 under 100ms and P99 under 500ms is a common target. For real-time applications (gaming, trading), P99 under 10ms may be required. For background jobs and batch processing, latency in seconds is often acceptable. Always define your target based on user expectations and business impact.
How does network latency differ from application latency?
Network latency is the time data takes to travel between two points — governed by physics (speed of light) and network infrastructure. Application latency is the time your code takes to process a request. Total latency = network latency + application latency + queueing delays. You can optimize application latency with better code, but network latency has a hard floor: light takes about 67ms to travel from New York to London.
Why do production systems target 50-70% utilization instead of 100%?
Due to queueing theory, latency increases exponentially as utilization approaches 100%. At 90% utilization, latency is roughly 10x the baseline. At 99%, it is 100x. Running at 50-70% provides a comfortable buffer for traffic spikes while keeping latency reasonable. This is also why auto-scaling triggers well before 100% capacity — typically at 60-80% CPU utilization.