Circuit Breaker Pattern: Building Resilient Distributed Systems
The circuit breaker pattern prevents cascading failures in distributed systems by stopping requests to a failing service. Like an electrical circuit breaker that trips to prevent damage, a software circuit breaker detects failures and short-circuits calls to give the failing service time to recover. This pattern is essential for building fault-tolerant microservices architectures.
Why Circuit Breakers Are Essential
In a microservices architecture, a single failing service can bring down the entire system. Without a circuit breaker:
- Threads pile up waiting for timeouts from the failing service
- Resource pools (connection pools, thread pools) become exhausted
- Upstream services start failing due to resource starvation
- A localized failure cascades into a system-wide outage
The circuit breaker works alongside rate limiting and partial failure handling to create resilient systems.
Circuit Breaker States
A circuit breaker has three states:
| State | Behavior | Transition Condition |
|---|---|---|
| Closed | Normal operation. Requests pass through. Failures are counted. | Failure count exceeds threshold → Open |
| Open | All requests are immediately rejected with a fallback response. | Timeout expires → Half-Open |
| Half-Open | A limited number of test requests are allowed through. | Test succeeds → Closed; Test fails → Open |
Implementation in Python
import time
import threading
from enum import Enum
class State(Enum):
CLOSED = "closed"
OPEN = "open"
HALF_OPEN = "half_open"
class CircuitBreaker:
def __init__(self, failure_threshold=5, recovery_timeout=30,
half_open_max_calls=3):
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.half_open_max_calls = half_open_max_calls
self.state = State.CLOSED
self.failure_count = 0
self.success_count = 0
self.last_failure_time = None
self.half_open_calls = 0
self.lock = threading.Lock()
def call(self, func, *args, **kwargs):
with self.lock:
if self.state == State.OPEN:
if self._should_attempt_reset():
self.state = State.HALF_OPEN
self.half_open_calls = 0
else:
raise CircuitOpenError("Circuit is open")
if self.state == State.HALF_OPEN:
if self.half_open_calls >= self.half_open_max_calls:
raise CircuitOpenError("Half-open limit reached")
self.half_open_calls += 1
try:
result = func(*args, **kwargs)
self._on_success()
return result
except Exception as e:
self._on_failure()
raise
def _on_success(self):
with self.lock:
if self.state == State.HALF_OPEN:
self.success_count += 1
if self.success_count >= self.half_open_max_calls:
self._reset()
else:
self.failure_count = 0
def _on_failure(self):
with self.lock:
self.failure_count += 1
self.last_failure_time = time.monotonic()
if self.state == State.HALF_OPEN:
self.state = State.OPEN
elif self.failure_count >= self.failure_threshold:
self.state = State.OPEN
def _should_attempt_reset(self):
return (time.monotonic() - self.last_failure_time
>= self.recovery_timeout)
def _reset(self):
self.state = State.CLOSED
self.failure_count = 0
self.success_count = 0
self.half_open_calls = 0
class CircuitOpenError(Exception):
pass
Using the Circuit Breaker
import requests
cb = CircuitBreaker(failure_threshold=5, recovery_timeout=30)
def call_payment_service(order_id, amount):
try:
result = cb.call(
requests.post,
"https://payment-service/charge",
json={"order_id": order_id, "amount": amount},
timeout=5
)
return result.json()
except CircuitOpenError:
return {"status": "pending", "message": "Payment service unavailable"}
except requests.RequestException:
return {"status": "error", "message": "Payment failed"}
Resilience4j (Java)
Resilience4j is the modern successor to Netflix Hystrix for Java applications. It provides a lightweight, functional circuit breaker implementation:
import io.github.resilience4j.circuitbreaker.CircuitBreaker;
import io.github.resilience4j.circuitbreaker.CircuitBreakerConfig;
CircuitBreakerConfig config = CircuitBreakerConfig.custom()
.failureRateThreshold(50) // 50% failure rate
.waitDurationInOpenState(Duration.ofSeconds(30))
.slidingWindowSize(10) // last 10 calls
.permittedNumberOfCallsInHalfOpenState(3)
.build();
CircuitBreaker circuitBreaker = CircuitBreaker.of("paymentService", config);
Supplier<String> decoratedSupplier = CircuitBreaker
.decorateSupplier(circuitBreaker, () -> paymentService.charge(orderId));
Try<String> result = Try.ofSupplier(decoratedSupplier)
.recover(CallNotPermittedException.class,
e -> "Payment service temporarily unavailable");
Netflix Hystrix (Legacy)
While Hystrix is now in maintenance mode, understanding it is valuable because many systems still use it:
public class PaymentCommand extends HystrixCommand<PaymentResult> {
private final String orderId;
public PaymentCommand(String orderId) {
super(HystrixCommandGroupKey.Factory.asKey("PaymentService"));
this.orderId = orderId;
}
@Override
protected PaymentResult run() {
return paymentService.charge(orderId);
}
@Override
protected PaymentResult getFallback() {
return new PaymentResult("pending", "Service unavailable");
}
}
Configuration Parameters
| Parameter | Description | Typical Value |
|---|---|---|
| Failure Threshold | Number or percentage of failures before opening | 50% or 5 consecutive |
| Recovery Timeout | Time to wait before trying half-open | 30-60 seconds |
| Sliding Window Size | Number of calls to evaluate | 10-100 calls |
| Half-Open Calls | Test calls allowed in half-open state | 3-5 calls |
| Slow Call Threshold | Duration to consider a call slow | 2-5 seconds |
Real-World Example: Netflix
Netflix pioneered the circuit breaker pattern in microservices. With over 1,000 microservices, a failure in one service (e.g., the recommendation engine) could cascade to the entire platform. Netflix uses circuit breakers at every service boundary:
- If the recommendation service fails, the UI shows a generic top-10 list instead
- If the personalization service is slow, the circuit opens and serves cached data
- Each microservice has its own circuit breaker configuration tuned to its SLA
This approach connects with service discovery for routing around failed instances and distributed transaction patterns for maintaining data consistency during partial outages.
Monitoring and Observability
A circuit breaker is only useful if you can observe its state. Key metrics to track:
- Circuit state transitions: Alert when a circuit opens
- Failure rate: Track the percentage of failed calls per service
- Fallback invocations: Monitor how often fallbacks are triggered
- Response time distribution: Detect slow calls before they trigger the breaker
Use the System Design Calculator to model the impact of circuit breaker settings on your system availability.
Frequently Asked Questions
Q: When should I use a circuit breaker vs a retry?
Use retries for transient failures (network blips, momentary timeouts). Use circuit breakers when a service is consistently failing. In practice, combine both: retry a few times, and if failures persist, the circuit breaker opens to prevent further attempts. This is part of a broader partial failure handling strategy.
Q: Should every service call have a circuit breaker?
Yes, any remote call that could fail should have a circuit breaker. This includes HTTP calls, database queries, cache lookups, and message queue operations. The configuration may differ — a database circuit breaker might have a lower failure threshold than an external API call.
Q: How do circuit breakers work with service meshes?
Service meshes like Istio and Linkerd provide built-in circuit breaking at the infrastructure level. This means you get circuit breaking without modifying application code. Istio uses Envoy proxy's outlier detection to automatically eject unhealthy endpoints from the load balancing pool.
Q: What is the relationship between circuit breakers and bulkheads?
The bulkhead pattern isolates resources so a failure in one area does not affect another. Circuit breakers detect and stop failures. Together, they provide comprehensive resilience: bulkheads prevent resource exhaustion while circuit breakers prevent cascading failures. Both are critical patterns described in our partial failures guide.