Skip to main content
🧠Advanced Topics

Circuit Breaker Pattern: Building Resilient Distributed Systems

The circuit breaker pattern prevents cascading failures in distributed systems by stopping requests to a failing service. Like an electrical circuit breake...

📖 5 min read

Circuit Breaker Pattern: Building Resilient Distributed Systems

The circuit breaker pattern prevents cascading failures in distributed systems by stopping requests to a failing service. Like an electrical circuit breaker that trips to prevent damage, a software circuit breaker detects failures and short-circuits calls to give the failing service time to recover. This pattern is essential for building fault-tolerant microservices architectures.

Why Circuit Breakers Are Essential

In a microservices architecture, a single failing service can bring down the entire system. Without a circuit breaker:

  • Threads pile up waiting for timeouts from the failing service
  • Resource pools (connection pools, thread pools) become exhausted
  • Upstream services start failing due to resource starvation
  • A localized failure cascades into a system-wide outage

The circuit breaker works alongside rate limiting and partial failure handling to create resilient systems.

Circuit Breaker States

A circuit breaker has three states:

State Behavior Transition Condition
Closed Normal operation. Requests pass through. Failures are counted. Failure count exceeds threshold → Open
Open All requests are immediately rejected with a fallback response. Timeout expires → Half-Open
Half-Open A limited number of test requests are allowed through. Test succeeds → Closed; Test fails → Open

Implementation in Python

import time
import threading
from enum import Enum

class State(Enum):
    CLOSED = "closed"
    OPEN = "open"
    HALF_OPEN = "half_open"

class CircuitBreaker:
    def __init__(self, failure_threshold=5, recovery_timeout=30,
                 half_open_max_calls=3):
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.half_open_max_calls = half_open_max_calls

        self.state = State.CLOSED
        self.failure_count = 0
        self.success_count = 0
        self.last_failure_time = None
        self.half_open_calls = 0
        self.lock = threading.Lock()

    def call(self, func, *args, **kwargs):
        with self.lock:
            if self.state == State.OPEN:
                if self._should_attempt_reset():
                    self.state = State.HALF_OPEN
                    self.half_open_calls = 0
                else:
                    raise CircuitOpenError("Circuit is open")

            if self.state == State.HALF_OPEN:
                if self.half_open_calls >= self.half_open_max_calls:
                    raise CircuitOpenError("Half-open limit reached")
                self.half_open_calls += 1

        try:
            result = func(*args, **kwargs)
            self._on_success()
            return result
        except Exception as e:
            self._on_failure()
            raise

    def _on_success(self):
        with self.lock:
            if self.state == State.HALF_OPEN:
                self.success_count += 1
                if self.success_count >= self.half_open_max_calls:
                    self._reset()
            else:
                self.failure_count = 0

    def _on_failure(self):
        with self.lock:
            self.failure_count += 1
            self.last_failure_time = time.monotonic()
            if self.state == State.HALF_OPEN:
                self.state = State.OPEN
            elif self.failure_count >= self.failure_threshold:
                self.state = State.OPEN

    def _should_attempt_reset(self):
        return (time.monotonic() - self.last_failure_time
                >= self.recovery_timeout)

    def _reset(self):
        self.state = State.CLOSED
        self.failure_count = 0
        self.success_count = 0
        self.half_open_calls = 0

class CircuitOpenError(Exception):
    pass

Using the Circuit Breaker

import requests

cb = CircuitBreaker(failure_threshold=5, recovery_timeout=30)

def call_payment_service(order_id, amount):
    try:
        result = cb.call(
            requests.post,
            "https://payment-service/charge",
            json={"order_id": order_id, "amount": amount},
            timeout=5
        )
        return result.json()
    except CircuitOpenError:
        return {"status": "pending", "message": "Payment service unavailable"}
    except requests.RequestException:
        return {"status": "error", "message": "Payment failed"}

Resilience4j (Java)

Resilience4j is the modern successor to Netflix Hystrix for Java applications. It provides a lightweight, functional circuit breaker implementation:

import io.github.resilience4j.circuitbreaker.CircuitBreaker;
import io.github.resilience4j.circuitbreaker.CircuitBreakerConfig;

CircuitBreakerConfig config = CircuitBreakerConfig.custom()
    .failureRateThreshold(50)              // 50% failure rate
    .waitDurationInOpenState(Duration.ofSeconds(30))
    .slidingWindowSize(10)                  // last 10 calls
    .permittedNumberOfCallsInHalfOpenState(3)
    .build();

CircuitBreaker circuitBreaker = CircuitBreaker.of("paymentService", config);

Supplier<String> decoratedSupplier = CircuitBreaker
    .decorateSupplier(circuitBreaker, () -> paymentService.charge(orderId));

Try<String> result = Try.ofSupplier(decoratedSupplier)
    .recover(CallNotPermittedException.class,
             e -> "Payment service temporarily unavailable");

Netflix Hystrix (Legacy)

While Hystrix is now in maintenance mode, understanding it is valuable because many systems still use it:

public class PaymentCommand extends HystrixCommand<PaymentResult> {
    private final String orderId;

    public PaymentCommand(String orderId) {
        super(HystrixCommandGroupKey.Factory.asKey("PaymentService"));
        this.orderId = orderId;
    }

    @Override
    protected PaymentResult run() {
        return paymentService.charge(orderId);
    }

    @Override
    protected PaymentResult getFallback() {
        return new PaymentResult("pending", "Service unavailable");
    }
}

Configuration Parameters

Parameter Description Typical Value
Failure Threshold Number or percentage of failures before opening 50% or 5 consecutive
Recovery Timeout Time to wait before trying half-open 30-60 seconds
Sliding Window Size Number of calls to evaluate 10-100 calls
Half-Open Calls Test calls allowed in half-open state 3-5 calls
Slow Call Threshold Duration to consider a call slow 2-5 seconds

Real-World Example: Netflix

Netflix pioneered the circuit breaker pattern in microservices. With over 1,000 microservices, a failure in one service (e.g., the recommendation engine) could cascade to the entire platform. Netflix uses circuit breakers at every service boundary:

  • If the recommendation service fails, the UI shows a generic top-10 list instead
  • If the personalization service is slow, the circuit opens and serves cached data
  • Each microservice has its own circuit breaker configuration tuned to its SLA

This approach connects with service discovery for routing around failed instances and distributed transaction patterns for maintaining data consistency during partial outages.

Monitoring and Observability

A circuit breaker is only useful if you can observe its state. Key metrics to track:

  • Circuit state transitions: Alert when a circuit opens
  • Failure rate: Track the percentage of failed calls per service
  • Fallback invocations: Monitor how often fallbacks are triggered
  • Response time distribution: Detect slow calls before they trigger the breaker

Use the System Design Calculator to model the impact of circuit breaker settings on your system availability.

Frequently Asked Questions

Q: When should I use a circuit breaker vs a retry?

Use retries for transient failures (network blips, momentary timeouts). Use circuit breakers when a service is consistently failing. In practice, combine both: retry a few times, and if failures persist, the circuit breaker opens to prevent further attempts. This is part of a broader partial failure handling strategy.

Q: Should every service call have a circuit breaker?

Yes, any remote call that could fail should have a circuit breaker. This includes HTTP calls, database queries, cache lookups, and message queue operations. The configuration may differ — a database circuit breaker might have a lower failure threshold than an external API call.

Q: How do circuit breakers work with service meshes?

Service meshes like Istio and Linkerd provide built-in circuit breaking at the infrastructure level. This means you get circuit breaking without modifying application code. Istio uses Envoy proxy's outlier detection to automatically eject unhealthy endpoints from the load balancing pool.

Q: What is the relationship between circuit breakers and bulkheads?

The bulkhead pattern isolates resources so a failure in one area does not affect another. Circuit breakers detect and stop failures. Together, they provide comprehensive resilience: bulkheads prevent resource exhaustion while circuit breakers prevent cascading failures. Both are critical patterns described in our partial failures guide.

Related Articles