Distributed Transactions: Maintaining Consistency Across Services

Distributed transactions coordinate data changes across multiple databases, services, or microservices. They are one of the hardest problems in distributed systems — ensuring that either all participants commit their changes or none do, despite network failures, crashes, and timeouts. This guide covers 2PC, 3PC, the saga pattern, and practical approaches to managing distributed data consistency.

Why Distributed Transactions Are Hard

In a single database, ACID transactions are straightforward — the database engine handles atomicity, consistency, isolation, and durability internally. When a transaction spans multiple databases or services, there is no single entity that controls all participants. Network partitions, process crashes, and message delays can cause participants to disagree on whether a transaction should commit or abort.

Consider an e-commerce order: the order service creates the order, the payment service charges the credit card, and the inventory service reduces stock. If the payment succeeds but the inventory update fails, the system is in an inconsistent state. Distributed transactions exist to prevent this.

Two-Phase Commit (2PC)

Two-Phase Commit is the classic protocol for distributed transactions. A designated coordinator manages the process in two phases.

Phase 1: Prepare (Voting)

The coordinator sends a PREPARE message to all participants. Each participant executes the transaction locally (without committing) and responds with either VOTE_COMMIT (ready to commit) or VOTE_ABORT (cannot commit).

Phase 2: Commit/Abort (Decision)

If ALL participants voted COMMIT, the coordinator sends a COMMIT message. If ANY participant voted ABORT, the coordinator sends an ABORT message to all. Participants execute the decision and acknowledge.

-- 2PC: Pseudocode for a distributed order transaction

-- COORDINATOR:
function process_order(order):
    # Phase 1: Prepare
    order_result = order_service.prepare(order)
    payment_result = payment_service.prepare(order.payment)
    inventory_result = inventory_service.prepare(order.items)

    # Phase 2: Decide
    if all_voted_commit(order_result, payment_result, inventory_result):
        order_service.commit(order.tx_id)
        payment_service.commit(order.tx_id)
        inventory_service.commit(order.tx_id)
        return SUCCESS
    else:
        order_service.abort(order.tx_id)
        payment_service.abort(order.tx_id)
        inventory_service.abort(order.tx_id)
        return FAILURE

-- PARTICIPANT (e.g., inventory_service):
function prepare(items):
    begin_transaction()
    for item in items:
        if stock[item.product_id] >= item.quantity:
            stock[item.product_id] -= item.quantity  -- tentatively
        else:
            return VOTE_ABORT
    write_to_wal()  -- persist the decision
    return VOTE_COMMIT

function commit(tx_id):
    commit_transaction(tx_id)

function abort(tx_id):
    rollback_transaction(tx_id)

2PC Problems

Problem	Description	Impact
Blocking	If the coordinator crashes after Phase 1, participants are stuck holding locks	Resources locked indefinitely until coordinator recovers
Single point of failure	The coordinator is critical — its failure blocks everyone	Availability depends on coordinator uptime
Latency	Requires multiple network round trips	Slower than local transactions
Network partition	Participants may not receive the commit/abort decision	Inconsistent state possible

Three-Phase Commit (3PC)

3PC adds a PRE-COMMIT phase between the prepare and commit phases to reduce the blocking problem. After all participants vote to commit, the coordinator sends a PRE-COMMIT message. Only after all acknowledge PRE-COMMIT does the coordinator send the final COMMIT.

The improvement: if the coordinator crashes after sending PRE-COMMIT, any participant can take over and complete the transaction (because all participants know the decision). However, 3PC is still vulnerable to network partitions and is rarely used in practice because it adds complexity without fully solving the problem.

The Saga Pattern

The saga pattern is the most practical approach for distributed transactions in microservices architectures. Instead of trying to make all operations atomic, a saga breaks the transaction into a sequence of local transactions, each with a compensating transaction that undoes its effect if a later step fails.

Choreography-Based Saga

In choreography, each service publishes events and listens for events from other services. There is no central coordinator — services react to events independently.

// Choreography Saga: E-commerce order flow

// Step 1: Order Service creates order
OrderService:
    createOrder(order) -> publish("OrderCreated", {orderId, items, userId})

// Step 2: Payment Service hears OrderCreated, charges payment
PaymentService:
    on("OrderCreated") -> chargeCard(order) -> publish("PaymentCompleted", {orderId})
    on charge failure -> publish("PaymentFailed", {orderId})

// Step 3: Inventory Service hears PaymentCompleted, reserves stock
InventoryService:
    on("PaymentCompleted") -> reserveStock(items) -> publish("StockReserved", {orderId})
    on reserve failure -> publish("StockReservationFailed", {orderId})

// Step 4: Shipping Service hears StockReserved, schedules shipment
ShippingService:
    on("StockReserved") -> scheduleShipment(order) -> publish("OrderShipped", {orderId})

// COMPENSATIONS (when something fails):
PaymentService:
    on("StockReservationFailed") -> refundPayment(orderId) -> publish("PaymentRefunded")

OrderService:
    on("PaymentFailed") -> cancelOrder(orderId)
    on("PaymentRefunded") -> cancelOrder(orderId)

Orchestration-Based Saga

In orchestration, a central saga orchestrator controls the flow. It tells each service what to do and handles failures by calling compensating transactions.

// Orchestration Saga: Central coordinator manages the flow

class OrderSagaOrchestrator:
    def execute(order):
        try:
            # Step 1: Create order
            order_id = order_service.create_order(order)

            # Step 2: Process payment
            payment_id = payment_service.charge(order.payment_info, order.total)

            # Step 3: Reserve inventory
            inventory_service.reserve(order.items)

            # Step 4: Schedule shipping
            shipping_service.schedule(order_id, order.shipping_address)

            return OrderResult(status="completed", order_id=order_id)

        except InventoryException:
            # Compensate: Refund payment, cancel order
            payment_service.refund(payment_id)
            order_service.cancel(order_id)
            return OrderResult(status="failed", reason="out_of_stock")

        except PaymentException:
            # Compensate: Cancel order
            order_service.cancel(order_id)
            return OrderResult(status="failed", reason="payment_declined")

        except ShippingException:
            # Compensate: Release inventory, refund, cancel
            inventory_service.release(order.items)
            payment_service.refund(payment_id)
            order_service.cancel(order_id)
            return OrderResult(status="failed", reason="shipping_unavailable")

Choreography vs Orchestration

Aspect	Choreography	Orchestration
Coordination	Decentralized (event-driven)	Centralized (orchestrator)
Coupling	Loose (services are independent)	Tighter (orchestrator knows all services)
Visibility	Hard to see full flow	Easy to see in orchestrator
Complexity	Simple for few services	Better for complex workflows
Failure handling	Each service handles its own	Orchestrator manages globally
Best for	2-4 services, simple flows	5+ services, complex business logic

Compensation Transactions

Compensating transactions undo the effects of a completed step. They are the key mechanism in sagas for maintaining consistency. Unlike a database ROLLBACK, compensations are business-level operations that may not perfectly reverse the original action.

// Compensation examples:
// Original: charge_credit_card($100) -> Compensation: refund_credit_card($100)
// Original: reserve_inventory(5 items) -> Compensation: release_inventory(5 items)
// Original: send_confirmation_email() -> Compensation: send_cancellation_email()
// Original: create_shipping_label() -> Compensation: cancel_shipping_label()

// Note: Some actions cannot be perfectly compensated
// Original: send_notification_to_user() -> Compensation: ??? (can't unsend)
// Solution: Design compensations to handle this (send correction notification)

Idempotency

In distributed systems, messages can be delivered more than once (network retries, duplicate events). Idempotency ensures that processing the same message multiple times produces the same result as processing it once. This is critical for reliable saga execution.

-- Idempotency: Use an idempotency key to prevent duplicate processing
CREATE TABLE processed_events (
    idempotency_key VARCHAR(255) PRIMARY KEY,
    result JSONB,
    processed_at TIMESTAMP DEFAULT NOW()
);

-- Before processing a payment:
function process_payment(idempotency_key, payment):
    # Check if already processed
    existing = SELECT result FROM processed_events
               WHERE idempotency_key = $idempotency_key

    if existing:
        return existing.result  # Return cached result

    # Process the payment
    result = charge_credit_card(payment)

    # Record the result
    INSERT INTO processed_events (idempotency_key, result)
    VALUES ($idempotency_key, $result)

    return result

TCC (Try-Confirm-Cancel) Pattern

TCC is a variation of the saga pattern with explicit reservation. Each service implements three operations: Try (reserve resources tentatively), Confirm (finalize the reservation), and Cancel (release the reservation).

// TCC Pattern for inventory
class InventoryTCC:
    def try_reserve(order_id, items):
        # Move items from available to reserved
        for item in items:
            available_stock[item.id] -= item.quantity
            reserved_stock[order_id][item.id] += item.quantity
        return "reserved"

    def confirm(order_id):
        # Remove from reserved (finalize)
        del reserved_stock[order_id]
        return "confirmed"

    def cancel(order_id):
        # Move items back from reserved to available
        for item_id, quantity in reserved_stock[order_id]:
            available_stock[item_id] += quantity
        del reserved_stock[order_id]
        return "cancelled"

// TCC handles the case where a user checks stock, but by the time
// payment processes, the stock is gone. The Try phase reserves it.

Real-World Challenges

Timeout Handling

What happens when a service does not respond? In 2PC, timeouts can cause blocking. In sagas, set timeout thresholds for each step and trigger compensation after the timeout expires. Always err on the side of compensation — it is better to refund a payment than to charge someone twice.

Partial Failures

What if the compensation itself fails? Implement retry logic with exponential backoff for compensations. Use a dead letter queue for compensations that repeatedly fail, and alert operators for manual intervention. This is one reason distributed transactions are hard — failure handling has many layers.

Ordering Guarantees

Events may arrive out of order. A "PaymentRefunded" event might arrive before "PaymentCompleted" due to network delays. Design your event handlers to be order-independent, or use sequence numbers and event sourcing to reconstruct the correct order.

Event Sourcing Connection

Event sourcing pairs naturally with sagas. Instead of storing the current state, you store the sequence of events that led to the current state. This provides a complete audit trail, enables replaying events to reconstruct state, and makes saga compensation easier — you can see exactly what happened and what needs to be undone.

For related topics, see our guides on ACID vs BASE properties, distributed databases, event-driven architecture, and microservices design patterns.

Frequently Asked Questions

Should I use 2PC or sagas?

Use sagas for microservices architectures. 2PC requires tight coupling between services (all must support the same transaction protocol) and has poor availability characteristics. Sagas are more loosely coupled, more resilient, and the standard approach in modern distributed systems. Use 2PC only within a single database or between closely coupled database systems.

How do I test distributed transactions?

Test failure scenarios explicitly: simulate network timeouts, service crashes at each step, duplicate messages, and out-of-order events. Use chaos engineering tools (Chaos Monkey, Toxiproxy) to inject failures. Integration tests should verify that compensations correctly undo effects. Contract tests ensure services agree on message formats.

What is the difference between a saga and eventual consistency?

A saga is a mechanism for coordinating distributed operations with compensation on failure. Eventual consistency is a property — the system will converge to a consistent state given enough time. Sagas achieve eventual consistency: during the saga execution, the system may be temporarily inconsistent (order created but not yet paid), but the saga ensures it reaches a consistent final state (either fully completed or fully compensated).

How do I handle distributed transactions with a mix of databases?

When crossing database boundaries (e.g., PostgreSQL and DynamoDB), 2PC is not possible because the databases do not share a transaction protocol. Use the saga pattern with application-level coordination. Each database handles its own local transaction, and the saga orchestrator manages the overall flow with compensations.

Can I avoid distributed transactions entirely?

Sometimes yes. Design your service boundaries so that related data lives in the same service and database. If an order and its payment are always processed together, keep them in the same service. This is the principle of "transaction boundary alignment" — align your microservice boundaries with your transaction boundaries to avoid distributed transactions where possible.

Distributed Transactions: Maintaining Consistency Across Services