Performance Optimization: Profiling and Tuning Distributed Systems
Performance optimization is the systematic process of identifying and eliminating bottlenecks in your system. Rather than guessing what is slow, effective optimization starts with measurement and profiling, then applies targeted fixes. This guide covers the essential techniques for profiling, solving common performance problems like N+1 queries, implementing connection pooling, leveraging async processing, and choosing the right caching strategy.
The Optimization Process
- Measure: Establish baseline performance metrics
- Profile: Identify where time is being spent
- Optimize: Fix the highest-impact bottleneck
- Validate: Measure again to confirm improvement
- Repeat: Target the next bottleneck
Profiling Techniques
import cProfile
import pstats
import time
from functools import wraps
# Simple timing decorator
def profile_endpoint(func):
@wraps(func)
def wrapper(*args, **kwargs):
start = time.perf_counter()
result = func(*args, **kwargs)
duration = time.perf_counter() - start
metrics.histogram("endpoint_duration",
duration, tags={"endpoint": func.__name__})
if duration > 1.0:
logger.warning(f"Slow endpoint: {func.__name__} "
f"took {duration:.3f}s")
return result
return wrapper
# Detailed CPU profiling
def profile_function(func, *args, **kwargs):
profiler = cProfile.Profile()
profiler.enable()
result = func(*args, **kwargs)
profiler.disable()
stats = pstats.Stats(profiler)
stats.sort_stats("cumulative")
stats.print_stats(20) # Top 20 time consumers
return result
The N+1 Query Problem
The N+1 problem occurs when code executes one query to fetch N records, then executes N additional queries to fetch related data for each record. This is one of the most common performance killers in web applications.
# N+1 Problem: 1 + N queries
def get_orders_bad():
orders = db.query("SELECT * FROM orders LIMIT 100") # 1 query
for order in orders:
# N queries (one per order!)
order.customer = db.query(
"SELECT * FROM customers WHERE id = %s", [order.customer_id]
)
order.items = db.query(
"SELECT * FROM order_items WHERE order_id = %s", [order.id]
)
return orders # Total: 1 + 100 + 100 = 201 queries!
# Solution 1: JOIN (single query)
def get_orders_join():
return db.query("""
SELECT o.*, c.name as customer_name, c.email
FROM orders o
JOIN customers c ON o.customer_id = c.id
LIMIT 100
""") # 1 query
# Solution 2: Batch loading (2-3 queries)
def get_orders_batch():
orders = db.query("SELECT * FROM orders LIMIT 100")
customer_ids = [o.customer_id for o in orders]
order_ids = [o.id for o in orders]
customers = db.query(
"SELECT * FROM customers WHERE id IN %s", [tuple(customer_ids)]
)
items = db.query(
"SELECT * FROM order_items WHERE order_id IN %s", [tuple(order_ids)]
)
customer_map = {c.id: c for c in customers}
items_map = defaultdict(list)
for item in items:
items_map[item.order_id].append(item)
for order in orders:
order.customer = customer_map[order.customer_id]
order.items = items_map[order.id]
return orders # Total: 3 queries
Connection Pooling
Creating a new database connection for every request is expensive (TCP handshake, TLS negotiation, authentication). Connection pooling maintains a pool of reusable connections.
from psycopg2 import pool
# Database connection pool
db_pool = pool.ThreadedConnectionPool(
minconn=5,
maxconn=20,
host="db.example.com",
database="myapp",
user="app_user",
password="secret"
)
def execute_query(sql, params=None):
conn = db_pool.getconn()
try:
with conn.cursor() as cur:
cur.execute(sql, params)
result = cur.fetchall()
conn.commit()
return result
finally:
db_pool.putconn(conn) # Return to pool, not close
// HikariCP connection pool (Java - Spring Boot)
spring:
datasource:
hikari:
minimum-idle: 5
maximum-pool-size: 20
idle-timeout: 300000
connection-timeout: 30000
max-lifetime: 1800000
pool-name: MyAppPool
Async Processing Patterns
import asyncio
import aiohttp
# Sequential: 10 API calls * 200ms = 2000ms total
async def fetch_sequential(urls):
results = []
async with aiohttp.ClientSession() as session:
for url in urls:
async with session.get(url) as response:
results.append(await response.json())
return results
# Parallel: 10 API calls concurrently = ~200ms total
async def fetch_parallel(urls):
async with aiohttp.ClientSession() as session:
tasks = [session.get(url) for url in urls]
responses = await asyncio.gather(*tasks)
return [await r.json() for r in responses]
# Offload to background queue
from celery import Celery
app = Celery("tasks", broker="redis://localhost:6379")
@app.task
def send_welcome_email(user_id):
user = get_user(user_id)
email_service.send(user.email, "Welcome!", template="welcome")
# API endpoint responds immediately
def create_user(request):
user = db.create_user(request.json)
send_welcome_email.delay(user.id) # Non-blocking
return {"id": user.id, "status": "created"}
Caching Strategies
| Strategy | Read Path | Write Path | Best For |
|---|---|---|---|
| Cache-Aside (Lazy) | Check cache; miss loads from DB | Write to DB, invalidate cache | General purpose, read-heavy |
| Read-Through | Cache auto-loads on miss | Write to DB, invalidate cache | Simplified cache management |
| Write-Through | Always read from cache | Write to cache and DB synchronously | Data that must be in cache |
| Write-Behind | Always read from cache | Write to cache; async flush to DB | Write-heavy, tolerates some data loss |
import redis
import json
cache = redis.Redis()
CACHE_TTL = 3600 # 1 hour
# Cache-Aside pattern
def get_user(user_id):
cache_key = f"user:{user_id}"
cached = cache.get(cache_key)
if cached:
return json.loads(cached)
user = db.query("SELECT * FROM users WHERE id = %s", [user_id])
cache.setex(cache_key, CACHE_TTL, json.dumps(user))
return user
def update_user(user_id, data):
db.execute("UPDATE users SET name = %s WHERE id = %s",
[data["name"], user_id])
cache.delete(f"user:{user_id}") # Invalidate cache
Performance optimization connects to latency reduction, throughput optimization, and load testing for validation. Use the System Design Calculator to model the impact of caching on your system performance.
Frequently Asked Questions
Q: How do I find the biggest performance bottleneck?
Start with APM tools (Datadog, New Relic, Jaeger) that show end-to-end request traces. Look at the flame graph to see where time is spent. Usually the bottleneck is database queries (slow queries, N+1), external API calls, or blocking I/O. Fix the slowest component first for maximum impact.
Q: When should I use caching vs optimizing the query?
Always optimize the query first — add indexes, rewrite inefficient queries, use JOINs instead of N+1. Caching should be a second layer for data that is read frequently and changes infrequently. Caching masks problems; query optimization solves them.
Q: How do I handle cache invalidation?
There are only two hard things in computer science: cache invalidation and naming things. Use TTL-based expiration as a safety net. For active invalidation, use the outbox pattern or CDC to publish change events that trigger cache invalidation. For critical data, use write-through caching to keep cache and DB in sync.