Distributed Transactions: Saga Pattern, Outbox, and Event-Driven Approaches
Distributed transactions are one of the hardest problems in system design. When a business operation spans multiple services or databases, ensuring atomicity — either all parts succeed or all parts are rolled back — becomes extraordinarily complex. This guide covers the practical patterns used in production systems, including the Saga pattern, compensating transactions, the transactional outbox, and event-driven architectures.
Why Traditional Transactions Do Not Work Across Services
Two-Phase Commit (2PC) provides atomicity but at a high cost: blocking, reduced availability, tight coupling, and poor performance. In microservices, where services are independently deployed and may use different databases, 2PC is impractical. Instead, we embrace eventual consistency and use patterns that compensate for partial failures.
The Saga Pattern
A saga is a sequence of local transactions where each step publishes an event or triggers the next step. If any step fails, compensating transactions are executed to undo the preceding steps. There are two coordination approaches:
Choreography-Based Saga
Each service listens for events and decides independently what to do next. There is no central coordinator.
# Order Saga - Choreography
# Step 1: Order Service creates order
# Publishes: OrderCreated event
# Step 2: Payment Service hears OrderCreated
# Charges the customer
# Publishes: PaymentCompleted or PaymentFailed
# Step 3: Inventory Service hears PaymentCompleted
# Reserves inventory
# Publishes: InventoryReserved or InventoryFailed
# Step 4: Shipping Service hears InventoryReserved
# Schedules shipment
# Publishes: ShipmentScheduled
# If PaymentFailed:
# Order Service hears it -> cancels order (compensating)
# If InventoryFailed:
# Payment Service hears it -> refunds payment (compensating)
# Order Service hears it -> cancels order (compensating)
class OrderService:
def create_order(self, order_data):
order = self.db.create_order(order_data)
self.event_bus.publish("OrderCreated", {
"order_id": order.id,
"customer_id": order_data["customer_id"],
"amount": order_data["total"],
"items": order_data["items"]
})
return order
def on_payment_failed(self, event):
self.db.update_order_status(event["order_id"], "CANCELLED")
self.notify_customer(event["order_id"], "Payment failed")
def on_inventory_failed(self, event):
self.db.update_order_status(event["order_id"], "CANCELLED")
self.notify_customer(event["order_id"], "Item out of stock")
class PaymentService:
def on_order_created(self, event):
try:
charge = self.payment_gateway.charge(
event["customer_id"], event["amount"]
)
self.db.record_payment(event["order_id"], charge.id)
self.event_bus.publish("PaymentCompleted", {
"order_id": event["order_id"],
"payment_id": charge.id
})
except PaymentError:
self.event_bus.publish("PaymentFailed", {
"order_id": event["order_id"],
"reason": "Payment declined"
})
def on_inventory_failed(self, event):
# Compensating transaction: refund
payment = self.db.get_payment(event["order_id"])
self.payment_gateway.refund(payment.charge_id)
self.db.record_refund(event["order_id"])
Orchestration-Based Saga
A central orchestrator (saga coordinator) directs the saga steps and handles compensations. This provides clearer control flow and easier error handling.
class OrderSagaOrchestrator:
def __init__(self):
self.steps = [
SagaStep(
action=self.create_order,
compensation=self.cancel_order
),
SagaStep(
action=self.process_payment,
compensation=self.refund_payment
),
SagaStep(
action=self.reserve_inventory,
compensation=self.release_inventory
),
SagaStep(
action=self.schedule_shipping,
compensation=self.cancel_shipping
),
]
def execute(self, order_data):
saga_id = generate_saga_id()
completed_steps = []
for step in self.steps:
try:
result = step.action(saga_id, order_data)
completed_steps.append(step)
order_data.update(result)
except Exception as e:
# Compensate in reverse order
for completed_step in reversed(completed_steps):
try:
completed_step.compensation(saga_id, order_data)
except Exception as comp_error:
log.error(f"Compensation failed: {comp_error}")
self.dead_letter_queue.add(saga_id, completed_step)
raise SagaFailedError(f"Saga failed at step: {step}")
return {"saga_id": saga_id, "status": "COMPLETED"}
Choreography vs Orchestration
| Aspect | Choreography | Orchestration |
|---|---|---|
| Coupling | Loose — services only know events | Tighter — orchestrator knows all services |
| Complexity | Distributed (harder to trace) | Centralized (easier to understand) |
| Best For | Simple sagas (2-3 steps) | Complex sagas (4+ steps) |
| Single Point of Failure | No | Yes (orchestrator) |
| Testing | Harder — need to simulate events | Easier — test orchestrator logic |
Compensating Transactions
Compensating transactions are the undo operations for each saga step. They must be idempotent because they may be retried. Important: compensating transactions are not true rollbacks — they are semantic inverses that may leave artifacts.
| Action | Compensation | Note |
|---|---|---|
| Create Order | Cancel Order | Order record remains with CANCELLED status |
| Charge Payment | Refund Payment | Refund is a new transaction, not a rollback |
| Reserve Inventory | Release Inventory | Must handle concurrent reservations |
| Send Email | Send Correction Email | Cannot unsend; send an apology instead |
Transactional Outbox Pattern
A common problem with sagas: how do you atomically update the database AND publish an event? If the database update succeeds but event publishing fails, the saga is stuck. The outbox pattern solves this.
class OrderServiceWithOutbox:
def create_order(self, order_data):
with self.db.transaction() as txn:
# Step 1: Insert the order
order = txn.execute(
"INSERT INTO orders (customer_id, total, status) "
"VALUES (%s, %s, 'CREATED') RETURNING id",
[order_data["customer_id"], order_data["total"]]
)
# Step 2: Insert event into outbox table (same transaction)
txn.execute(
"INSERT INTO outbox (event_type, payload, created_at) "
"VALUES (%s, %s, NOW())",
["OrderCreated", json.dumps({
"order_id": order.id,
"customer_id": order_data["customer_id"],
"amount": order_data["total"]
})]
)
# Both succeed or both fail — atomic!
return order
class OutboxPublisher:
"""Background process that reads outbox and publishes events."""
def poll_and_publish(self):
events = self.db.query(
"SELECT id, event_type, payload FROM outbox "
"WHERE published = FALSE ORDER BY created_at LIMIT 100"
)
for event in events:
try:
self.event_bus.publish(event.event_type,
json.loads(event.payload))
self.db.execute(
"UPDATE outbox SET published = TRUE WHERE id = %s",
[event.id]
)
except Exception:
pass # Retry next poll cycle
Event-Driven Approach with CDC
Change Data Capture (CDC) reads the database transaction log and converts changes into events. Tools like Debezium can watch the outbox table and automatically publish events when new rows appear.
# Debezium connector configuration for outbox CDC
{
"name": "outbox-connector",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"database.hostname": "orders-db",
"database.port": "5432",
"database.dbname": "orders",
"table.include.list": "public.outbox",
"transforms": "outbox",
"transforms.outbox.type":
"io.debezium.transforms.outbox.EventRouter",
"transforms.outbox.table.field.event.type": "event_type",
"transforms.outbox.table.field.event.payload": "payload"
}
}
The combination of sagas, the outbox pattern, and CDC provides a robust foundation for distributed transactions in microservices. These patterns work together with idempotency to handle retries safely and partial failure handling for resilience. Use the System Design Calculator to model your saga complexity and failure scenarios.
Frequently Asked Questions
Q: Should I use Saga or 2PC?
Use 2PC when all participants are within a trusted, low-latency network (same data center) and you need strict atomicity. Use Sagas when services are independently deployed, use different databases, or span multiple data centers. In microservices, Sagas are almost always preferred.
Q: How do I handle failed compensating transactions?
Compensating transactions should be idempotent and retriable. If retries exhaust, send the failed compensation to a dead letter queue for manual or automated resolution. Monitor compensation failures closely — they indicate a saga that is in a partially undone state.
Q: What is the difference between the outbox pattern and dual writes?
Dual writes attempt to write to both the database and the message broker independently. If either fails, you get inconsistency. The outbox pattern writes the event to the database in the same transaction as the data change, guaranteeing atomicity. A separate process then reliably publishes the event.
Q: How do I handle saga timeouts?
Implement a saga timeout watchdog that monitors active sagas. If a saga step does not complete within a configurable timeout, trigger compensating transactions. Store the saga state (step, status, timestamps) in a database so the watchdog can detect stuck sagas even after service restarts.