Handling Partial Failures in Distributed Systems

Partial failures are the defining challenge of distributed systems. Unlike single-machine programs where the system either works or crashes completely, distributed systems can experience failures where some components work while others fail, and you often cannot tell the difference between a slow response and a failed one. This guide covers the essential patterns for building systems that gracefully handle partial failures.

What Are Partial Failures?

A partial failure occurs when part of a distributed system fails while the rest continues to operate. Examples include:

A network request times out — did the server receive it? Did it process it? You cannot tell.
One of five database replicas becomes unresponsive
A downstream microservice returns errors while others work fine
A single availability zone in your cloud provider goes down
A garbage collection pause makes a node temporarily unresponsive

Timeout Handling

Timeouts are the first line of defense against partial failures. Without timeouts, a client waiting for a response from a failed server will wait indefinitely, consuming resources and potentially causing cascading failures.

import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

class ResilientHTTPClient:
    def __init__(self):
        self.session = requests.Session()

        # Configure timeouts and retries
        retry_strategy = Retry(
            total=3,
            backoff_factor=1,
            status_forcelist=[502, 503, 504],
            allowed_methods=["GET", "PUT", "DELETE"]
        )
        adapter = HTTPAdapter(max_retries=retry_strategy)
        self.session.mount("http://", adapter)
        self.session.mount("https://", adapter)

    def get(self, url, **kwargs):
        kwargs.setdefault("timeout", (3, 10))  # (connect, read)
        return self.session.get(url, **kwargs)

    def post(self, url, **kwargs):
        kwargs.setdefault("timeout", (3, 30))
        return self.session.post(url, **kwargs)

Timeout Type	Description	Typical Value
Connection Timeout	Time to establish TCP connection	1-5 seconds
Read Timeout	Time to wait for response data	5-30 seconds
Overall Timeout	Total time for the entire operation	30-60 seconds
Idle Timeout	Time before closing an idle connection	60-300 seconds

Retry with Exponential Backoff and Jitter

When a request fails, retrying immediately often makes things worse (thundering herd). Exponential backoff with jitter spreads retries over time, reducing load on the failing system.

import random
import time

def retry_with_backoff(func, max_retries=5, base_delay=1.0,
                       max_delay=60.0, jitter=True):
    for attempt in range(max_retries):
        try:
            return func()
        except (ConnectionError, TimeoutError) as e:
            if attempt == max_retries - 1:
                raise

            delay = min(base_delay * (2 ** attempt), max_delay)
            if jitter:
                # Full jitter: random value between 0 and calculated delay
                delay = random.uniform(0, delay)

            print(f"Attempt {attempt + 1} failed. "
                  f"Retrying in {delay:.2f}s...")
            time.sleep(delay)

# Usage
result = retry_with_backoff(
    lambda: requests.get("https://api.example.com/data", timeout=10),
    max_retries=3,
    base_delay=1.0
)

Jitter strategies:

Full jitter: random(0, base * 2^attempt) — best for reducing contention
Equal jitter: base * 2^attempt / 2 + random(0, base * 2^attempt / 2) — guaranteed minimum delay
Decorrelated jitter: min(max_delay, random(base, last_delay * 3)) — AWS recommended

Circuit Breaker Integration

When retries fail repeatedly, the circuit breaker pattern stops trying and fails fast, giving the downstream service time to recover.

class ResilientServiceClient:
    def __init__(self, service_name, base_url):
        self.circuit_breaker = CircuitBreaker(
            failure_threshold=5,
            recovery_timeout=30
        )
        self.base_url = base_url
        self.service_name = service_name

    def call(self, endpoint, method="GET", **kwargs):
        def make_request():
            response = requests.request(
                method,
                f"{self.base_url}{endpoint}",
                timeout=(3, 10),
                **kwargs
            )
            response.raise_for_status()
            return response.json()

        try:
            return retry_with_backoff(
                lambda: self.circuit_breaker.call(make_request),
                max_retries=3,
                base_delay=0.5
            )
        except CircuitOpenError:
            return self.fallback(endpoint)
        except Exception:
            return self.fallback(endpoint)

    def fallback(self, endpoint):
        cached = self.cache.get(f"{self.service_name}:{endpoint}")
        if cached:
            return cached
        return {"error": "Service unavailable", "fallback": True}

Bulkhead Pattern

The bulkhead pattern isolates different parts of your system so that a failure in one part does not bring down the entire system. Named after the watertight compartments in a ship's hull.

from concurrent.futures import ThreadPoolExecutor
import threading

class BulkheadExecutor:
    def __init__(self):
        # Separate thread pools for different services
        self.pools = {
            "payment": ThreadPoolExecutor(max_workers=10,
                                          thread_name_prefix="payment"),
            "inventory": ThreadPoolExecutor(max_workers=15,
                                           thread_name_prefix="inventory"),
            "email": ThreadPoolExecutor(max_workers=5,
                                       thread_name_prefix="email"),
        }
        self.semaphores = {
            "payment": threading.Semaphore(10),
            "inventory": threading.Semaphore(15),
            "email": threading.Semaphore(5),
        }

    def execute(self, service_name, func, timeout=10):
        pool = self.pools.get(service_name)
        if not pool:
            raise ValueError(f"Unknown service: {service_name}")

        semaphore = self.semaphores[service_name]
        if not semaphore.acquire(timeout=1):
            raise BulkheadFullError(
                f"{service_name} bulkhead is full"
            )

        try:
            future = pool.submit(func)
            return future.result(timeout=timeout)
        finally:
            semaphore.release()

# If the payment service is slow, only 10 threads are blocked
# The inventory and email pools continue operating normally
bulkhead = BulkheadExecutor()
payment_result = bulkhead.execute("payment",
    lambda: payment_service.charge(order))

Graceful Degradation

Instead of failing completely when a dependency is unavailable, serve a reduced but functional experience:

Failed Service	Full Experience	Degraded Experience
Recommendation Engine	Personalized recommendations	Show trending/popular items
Search Service	Full-text search with facets	Category browsing only
User Profile Service	Full profile with history	Basic info from JWT token
Pricing Service	Dynamic pricing	Last known cached prices
Image CDN	High-res product images	Placeholder images or text descriptions

Real-World Example: Netflix

Netflix is the gold standard for partial failure handling. Their system includes:

Circuit breakers on every service call (via Hystrix/Resilience4j)
Bulkheads isolating thread pools per dependency
Fallbacks at every level (personalized → regional → global defaults)
Chaos engineering (Chaos Monkey) to regularly test failure handling
Adaptive concurrency limits that automatically detect and shed load

For more resilience patterns, see circuit breakers, rate limiting, and saga patterns for transactional resilience. Use the System Design Calculator to model failure scenarios and their impact on system availability.

Frequently Asked Questions

Q: How do I decide which operations to retry?

Only retry idempotent operations. Reads (GET) are always safe to retry. Writes should use idempotency keys so retries do not create duplicates. Never retry non-idempotent operations like POST without an idempotency mechanism.

Q: What is the difference between a timeout and a deadline?

A timeout is a duration (e.g., wait 5 seconds). A deadline is an absolute time (e.g., must complete by 14:30:05). Deadlines are better in multi-hop service calls because they account for time already spent. gRPC uses deadlines: if a request passes through 3 services, each service knows how much time remains.

Q: How do I test partial failure handling?

Use chaos engineering tools: Netflix Chaos Monkey (random instance termination), Gremlin (controlled failure injection), Toxiproxy (network condition simulation). In tests, inject failures using dependency injection: replace real clients with ones that randomly fail, timeout, or return errors.

Q: Should I use sync or async communication to handle partial failures?

Asynchronous communication (via message queues) naturally handles partial failures better because the sender does not wait for the receiver. If the receiver is down, messages queue up and are processed when it recovers. Use async for operations that do not need an immediate response. Use sync with circuit breakers and timeouts when immediate response is required.

Handling Partial Failures in Distributed Systems