Handling Partial Failures in Distributed Systems
Partial failures are the defining challenge of distributed systems. Unlike single-machine programs where the system either works or crashes completely, distributed systems can experience failures where some components work while others fail, and you often cannot tell the difference between a slow response and a failed one. This guide covers the essential patterns for building systems that gracefully handle partial failures.
What Are Partial Failures?
A partial failure occurs when part of a distributed system fails while the rest continues to operate. Examples include:
- A network request times out — did the server receive it? Did it process it? You cannot tell.
- One of five database replicas becomes unresponsive
- A downstream microservice returns errors while others work fine
- A single availability zone in your cloud provider goes down
- A garbage collection pause makes a node temporarily unresponsive
Timeout Handling
Timeouts are the first line of defense against partial failures. Without timeouts, a client waiting for a response from a failed server will wait indefinitely, consuming resources and potentially causing cascading failures.
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
class ResilientHTTPClient:
def __init__(self):
self.session = requests.Session()
# Configure timeouts and retries
retry_strategy = Retry(
total=3,
backoff_factor=1,
status_forcelist=[502, 503, 504],
allowed_methods=["GET", "PUT", "DELETE"]
)
adapter = HTTPAdapter(max_retries=retry_strategy)
self.session.mount("http://", adapter)
self.session.mount("https://", adapter)
def get(self, url, **kwargs):
kwargs.setdefault("timeout", (3, 10)) # (connect, read)
return self.session.get(url, **kwargs)
def post(self, url, **kwargs):
kwargs.setdefault("timeout", (3, 30))
return self.session.post(url, **kwargs)
| Timeout Type | Description | Typical Value |
|---|---|---|
| Connection Timeout | Time to establish TCP connection | 1-5 seconds |
| Read Timeout | Time to wait for response data | 5-30 seconds |
| Overall Timeout | Total time for the entire operation | 30-60 seconds |
| Idle Timeout | Time before closing an idle connection | 60-300 seconds |
Retry with Exponential Backoff and Jitter
When a request fails, retrying immediately often makes things worse (thundering herd). Exponential backoff with jitter spreads retries over time, reducing load on the failing system.
import random
import time
def retry_with_backoff(func, max_retries=5, base_delay=1.0,
max_delay=60.0, jitter=True):
for attempt in range(max_retries):
try:
return func()
except (ConnectionError, TimeoutError) as e:
if attempt == max_retries - 1:
raise
delay = min(base_delay * (2 ** attempt), max_delay)
if jitter:
# Full jitter: random value between 0 and calculated delay
delay = random.uniform(0, delay)
print(f"Attempt {attempt + 1} failed. "
f"Retrying in {delay:.2f}s...")
time.sleep(delay)
# Usage
result = retry_with_backoff(
lambda: requests.get("https://api.example.com/data", timeout=10),
max_retries=3,
base_delay=1.0
)
Jitter strategies:
- Full jitter:
random(0, base * 2^attempt)— best for reducing contention - Equal jitter:
base * 2^attempt / 2 + random(0, base * 2^attempt / 2)— guaranteed minimum delay - Decorrelated jitter:
min(max_delay, random(base, last_delay * 3))— AWS recommended
Circuit Breaker Integration
When retries fail repeatedly, the circuit breaker pattern stops trying and fails fast, giving the downstream service time to recover.
class ResilientServiceClient:
def __init__(self, service_name, base_url):
self.circuit_breaker = CircuitBreaker(
failure_threshold=5,
recovery_timeout=30
)
self.base_url = base_url
self.service_name = service_name
def call(self, endpoint, method="GET", **kwargs):
def make_request():
response = requests.request(
method,
f"{self.base_url}{endpoint}",
timeout=(3, 10),
**kwargs
)
response.raise_for_status()
return response.json()
try:
return retry_with_backoff(
lambda: self.circuit_breaker.call(make_request),
max_retries=3,
base_delay=0.5
)
except CircuitOpenError:
return self.fallback(endpoint)
except Exception:
return self.fallback(endpoint)
def fallback(self, endpoint):
cached = self.cache.get(f"{self.service_name}:{endpoint}")
if cached:
return cached
return {"error": "Service unavailable", "fallback": True}
Bulkhead Pattern
The bulkhead pattern isolates different parts of your system so that a failure in one part does not bring down the entire system. Named after the watertight compartments in a ship's hull.
from concurrent.futures import ThreadPoolExecutor
import threading
class BulkheadExecutor:
def __init__(self):
# Separate thread pools for different services
self.pools = {
"payment": ThreadPoolExecutor(max_workers=10,
thread_name_prefix="payment"),
"inventory": ThreadPoolExecutor(max_workers=15,
thread_name_prefix="inventory"),
"email": ThreadPoolExecutor(max_workers=5,
thread_name_prefix="email"),
}
self.semaphores = {
"payment": threading.Semaphore(10),
"inventory": threading.Semaphore(15),
"email": threading.Semaphore(5),
}
def execute(self, service_name, func, timeout=10):
pool = self.pools.get(service_name)
if not pool:
raise ValueError(f"Unknown service: {service_name}")
semaphore = self.semaphores[service_name]
if not semaphore.acquire(timeout=1):
raise BulkheadFullError(
f"{service_name} bulkhead is full"
)
try:
future = pool.submit(func)
return future.result(timeout=timeout)
finally:
semaphore.release()
# If the payment service is slow, only 10 threads are blocked
# The inventory and email pools continue operating normally
bulkhead = BulkheadExecutor()
payment_result = bulkhead.execute("payment",
lambda: payment_service.charge(order))
Graceful Degradation
Instead of failing completely when a dependency is unavailable, serve a reduced but functional experience:
| Failed Service | Full Experience | Degraded Experience |
|---|---|---|
| Recommendation Engine | Personalized recommendations | Show trending/popular items |
| Search Service | Full-text search with facets | Category browsing only |
| User Profile Service | Full profile with history | Basic info from JWT token |
| Pricing Service | Dynamic pricing | Last known cached prices |
| Image CDN | High-res product images | Placeholder images or text descriptions |
Real-World Example: Netflix
Netflix is the gold standard for partial failure handling. Their system includes:
- Circuit breakers on every service call (via Hystrix/Resilience4j)
- Bulkheads isolating thread pools per dependency
- Fallbacks at every level (personalized → regional → global defaults)
- Chaos engineering (Chaos Monkey) to regularly test failure handling
- Adaptive concurrency limits that automatically detect and shed load
For more resilience patterns, see circuit breakers, rate limiting, and saga patterns for transactional resilience. Use the System Design Calculator to model failure scenarios and their impact on system availability.
Frequently Asked Questions
Q: How do I decide which operations to retry?
Only retry idempotent operations. Reads (GET) are always safe to retry. Writes should use idempotency keys so retries do not create duplicates. Never retry non-idempotent operations like POST without an idempotency mechanism.
Q: What is the difference between a timeout and a deadline?
A timeout is a duration (e.g., wait 5 seconds). A deadline is an absolute time (e.g., must complete by 14:30:05). Deadlines are better in multi-hop service calls because they account for time already spent. gRPC uses deadlines: if a request passes through 3 services, each service knows how much time remains.
Q: How do I test partial failure handling?
Use chaos engineering tools: Netflix Chaos Monkey (random instance termination), Gremlin (controlled failure injection), Toxiproxy (network condition simulation). In tests, inject failures using dependency injection: replace real clients with ones that randomly fail, timeout, or return errors.
Q: Should I use sync or async communication to handle partial failures?
Asynchronous communication (via message queues) naturally handles partial failures better because the sender does not wait for the receiver. If the receiver is down, messages queue up and are processed when it recovers. Use async for operations that do not need an immediate response. Use sync with circuit breakers and timeouts when immediate response is required.