Distributed Locks: Coordination in Distributed Systems

Distributed locks provide mutual exclusion across multiple processes or machines. Unlike local locks (mutexes) that work within a single process, distributed locks coordinate access to shared resources across a network. They are essential for preventing race conditions in distributed systems where multiple nodes may attempt to modify the same resource simultaneously.

Why Distributed Locks Are Needed

Consider these scenarios where distributed locks are critical:

Preventing double-processing: Two workers picking up the same job from a queue
Leader election: Only one node should act as the leader at a time (see leader election)
Resource reservation: Only one user should be able to book the last seat on a flight
Cron job coordination: Ensuring a scheduled task runs on exactly one node
Inventory management: Preventing overselling when stock is low

Properties of a Good Distributed Lock

Property	Description
Mutual Exclusion	At most one client holds the lock at any time
Deadlock-Free	Lock is eventually released even if the holder crashes (via TTL)
Fault Tolerant	Lock service remains available if some nodes fail
Owner Identity	Only the lock holder can release the lock

Redis-Based Distributed Lock

The simplest distributed lock uses Redis SET with NX (set if not exists) and EX (expiration):

import redis
import uuid
import time

class RedisLock:
    def __init__(self, redis_client, lock_name, ttl=10):
        self.redis = redis_client
        self.lock_name = f"lock:{lock_name}"
        self.ttl = ttl
        self.lock_value = str(uuid.uuid4())

    def acquire(self, timeout=30):
        end_time = time.monotonic() + timeout
        while time.monotonic() < end_time:
            if self.redis.set(self.lock_name, self.lock_value,
                            nx=True, ex=self.ttl):
                return True
            time.sleep(0.1)
        return False

    def release(self):
        # Lua script ensures atomic check-and-delete
        lua_script = """
        if redis.call("get", KEYS[1]) == ARGV[1] then
            return redis.call("del", KEYS[1])
        else
            return 0
        end
        """
        return self.redis.eval(lua_script, 1,
                              self.lock_name, self.lock_value)

    def __enter__(self):
        if not self.acquire():
            raise TimeoutError("Could not acquire lock")
        return self

    def __exit__(self, exc_type, exc_val, exc_tb):
        self.release()

# Usage
r = redis.Redis()
with RedisLock(r, "order:12345") as lock:
    process_order("12345")

Redis Redlock Algorithm

The single-instance Redis lock fails if the Redis node crashes. Redlock, designed by Salvatore Sanfilippo, uses multiple independent Redis instances for fault tolerance:

class Redlock:
    def __init__(self, redis_instances, ttl=10):
        self.instances = redis_instances  # List of Redis clients
        self.ttl = ttl
        self.quorum = len(redis_instances) // 2 + 1
        self.clock_drift_factor = 0.01

    def acquire(self):
        lock_value = str(uuid.uuid4())
        start_time = time.monotonic()
        successful = 0

        for instance in self.instances:
            try:
                if instance.set("mylock", lock_value,
                               nx=True, px=int(self.ttl * 1000)):
                    successful += 1
            except redis.RedisError:
                pass

        elapsed = time.monotonic() - start_time
        drift = self.ttl * self.clock_drift_factor + 0.002
        validity_time = self.ttl - elapsed - drift

        if successful >= self.quorum and validity_time > 0:
            return lock_value, validity_time
        else:
            # Failed - release all locks
            for instance in self.instances:
                try:
                    self._release_instance(instance, lock_value)
                except redis.RedisError:
                    pass
            return None, 0

Redlock Steps:

Get the current time in milliseconds
Try to acquire the lock on N independent Redis instances sequentially
Calculate elapsed time. If the lock was acquired on a majority (N/2+1) and the elapsed time is less than the TTL, the lock is considered acquired
If the lock was not acquired, release it on all instances

ZooKeeper Distributed Locks

Apache ZooKeeper provides distributed locks through its ephemeral sequential znodes:

from kazoo.client import KazooClient
from kazoo.recipe.lock import Lock

zk = KazooClient(hosts="zk1:2181,zk2:2181,zk3:2181")
zk.start()

lock = Lock(zk, "/locks/order-processing")

# Blocking acquire with timeout
if lock.acquire(timeout=30):
    try:
        process_order("12345")
    finally:
        lock.release()
else:
    print("Could not acquire lock")

ZooKeeper locks work by creating ephemeral sequential znodes. The client with the lowest sequence number holds the lock. Others watch the znode immediately before theirs, creating an efficient notification chain.

Fencing Tokens

A critical problem with distributed locks is that a lock holder might be paused (GC pause, network delay) and continue executing after the lock has expired and been acquired by another client. Fencing tokens solve this:

class FencedLock:
    def __init__(self, redis_client, lock_name, ttl=10):
        self.redis = redis_client
        self.lock_name = lock_name
        self.ttl = ttl

    def acquire(self):
        lock_value = str(uuid.uuid4())
        if self.redis.set(f"lock:{self.lock_name}", lock_value,
                         nx=True, ex=self.ttl):
            # Atomically increment and return a fencing token
            token = self.redis.incr(f"fence:{self.lock_name}")
            return lock_value, token
        return None, None

# The storage system checks the fencing token
def write_to_storage(key, value, fencing_token):
    current_token = storage.get_fencing_token(key)
    if fencing_token > current_token:
        storage.write(key, value, fencing_token)
    else:
        raise StaleTokenError("Fencing token is outdated")

Comparison of Distributed Lock Implementations

Feature	Redis (Single)	Redlock	ZooKeeper	etcd
Consistency	Weak	Better	Strong (ZAB)	Strong (Raft)
Performance	Very High	High	Moderate	Moderate
Fault Tolerance	Single point of failure	Tolerates minority failures	Tolerates minority failures	Tolerates minority failures
Auto-Release	TTL-based	TTL-based	Session-based	Lease-based

Common Pitfalls

Clock skew: Redlock relies on synchronized clocks. Large clock skew can cause multiple clients to hold the lock simultaneously.
Long GC pauses: A client might pause after acquiring the lock, resume after it expires, and corrupt shared state. Use fencing tokens.
Forgetting to release: Always use try/finally or context managers to ensure lock release.
Too-short TTL: If the TTL is shorter than the critical section, the lock expires while work is in progress.
No retry backoff: Tight retry loops on lock acquisition create thundering herd problems.

For related patterns, see idempotency as an alternative to locking, and distributed transactions for coordinating multi-step operations. Use the System Design Calculator to estimate lock contention in your system.

Frequently Asked Questions

Q: Should I use Redis or ZooKeeper for distributed locks?

Use Redis when performance is critical and you can tolerate a small window of unsafe behavior (e.g., during Redis failover). Use ZooKeeper or etcd when correctness is paramount — they provide stronger consistency guarantees through consensus protocols.

Q: Is Redlock safe?

There is a famous debate between Martin Kleppmann and Salvatore Sanfilippo about Redlock's safety. Kleppmann argues that Redlock is not safe because it depends on timing assumptions. For correctness-critical applications, use a consensus-based system like ZooKeeper or etcd, and always use fencing tokens.

Q: How do I handle lock renewal for long-running operations?

Implement a background thread that periodically extends the lock TTL before it expires. This is sometimes called a "lock watchdog." The Redisson library for Java implements this pattern automatically.

Q: Can I avoid distributed locks entirely?

Sometimes. Use idempotent operations with optimistic concurrency control (compare-and-swap) when possible. Use vector clocks or CRDTs for conflict-free concurrent updates. Locks are necessary when operations must be strictly serialized.

Distributed Locks: Coordination in Distributed Systems