Cache Invalidation: Solving the Hardest Problem in Computer Science

Phil Karlton famously said, "There are only two hard things in Computer Science: cache invalidation and naming things." While the quote is often shared as a joke, cache invalidation genuinely is one of the trickiest challenges in system design. Get it wrong, and your users see stale data, your systems become inconsistent, and debugging becomes a nightmare.

Cache invalidation is the process of removing or updating cached data when the underlying source data changes. This guide covers every major invalidation strategy, their trade-offs, and practical code examples to help you choose the right approach for your system.

Why Is Cache Invalidation So Hard?

The fundamental tension is between performance and consistency. Caching exists to avoid expensive operations, but the moment the source data changes, the cached copy becomes stale. The challenges multiply in distributed systems:

Multiple cache layers: Browser cache, CDN, application cache, database query cache — which ones need invalidation?
Race conditions: Two concurrent writes can leave the cache in an inconsistent state.
Distributed coordination: Invalidating across multiple cache nodes requires reliable messaging.
Cascading dependencies: Updating a user's name might require invalidating their profile cache, their comments cache, their friends' feed caches, and more.

Invalidation Strategies

1. TTL-Based Invalidation (Time to Live)

The simplest approach: every cached entry has an expiration time. After the TTL expires, the entry is automatically removed, and the next request fetches fresh data from the source.

# Redis TTL commands
SET user:1001 '{"name":"Alice","role":"admin"}' EX 300    # Expires in 5 minutes
SETEX product:5001 3600 '{"name":"Widget","price":29.99}' # Expires in 1 hour

# Check remaining TTL
TTL user:1001        # Returns seconds remaining (e.g., 287)
PTTL user:1001       # Returns milliseconds remaining

# Update TTL without changing value
EXPIRE user:1001 600  # Reset to 10 minutes

# Remove expiration (make persistent)
PERSIST user:1001

Choosing TTL values depends on your tolerance for staleness:

Data Type	Suggested TTL	Reasoning
User session	15-30 minutes	Security — limit exposure window
Product catalog	1-24 hours	Changes infrequently, high read volume
Social media feed	30-60 seconds	Users expect near-real-time updates
Configuration/feature flags	1-5 minutes	Balance between freshness and reducing config service load
DNS records	5 minutes - 48 hours	DNS propagation latency is expected

Pros: Dead simple, self-healing (stale data eventually disappears), no coordination needed.
Cons: Data can be stale for up to the full TTL duration, not suitable when freshness is critical.

2. Event-Based Invalidation

Instead of waiting for TTL expiry, the system actively invalidates cache entries when the source data changes. This is typically implemented using a publish/subscribe mechanism or message queue.

# Event-based invalidation with Redis Pub/Sub

# Publisher (runs when data changes)
def update_product(product_id, new_data):
    # 1. Update database
    database.update("products", product_id, new_data)
    
    # 2. Publish invalidation event
    redis_client.publish("cache:invalidate", f"product:{product_id}")

# Subscriber (runs on each cache-holding server)
def listen_for_invalidations():
    pubsub = redis_client.pubsub()
    pubsub.subscribe("cache:invalidate")
    
    for message in pubsub.listen():
        if message["type"] == "message":
            cache_key = message["data"]
            local_cache.delete(cache_key)
            redis_client.delete(cache_key)
            print(f"Invalidated: {cache_key}")

Pros: Near-real-time invalidation, data freshness is excellent.
Cons: Added complexity, requires reliable messaging infrastructure, risk of missed events.

3. Write-Through Invalidation

In the write-through pattern, every write operation updates both the cache and the database simultaneously. The cache is always consistent with the database.

def write_through_update(key, value):
    # Update cache and database atomically
    try:
        # Step 1: Write to database
        database.update(key, value)
        
        # Step 2: Update cache with new value
        cache.set(key, value, ttl=3600)
        
        return True
    except DatabaseError:
        # If DB write fails, remove stale cache entry
        cache.delete(key)
        raise

Pros: Strong consistency between cache and database, simple mental model.
Cons: Higher write latency (must write to both), not truly atomic without distributed transactions.

4. Write-Behind (Write-Back) Invalidation

The write-behind pattern writes to the cache immediately and asynchronously writes to the database later. This provides low write latency but risks data loss if the cache fails before the database write completes.

Pros: Very fast writes, batching reduces database load.
Cons: Risk of data loss, eventual consistency, complex implementation.

5. Version-Based Invalidation

Instead of deleting cache entries, you change the cache key by appending a version number. When data changes, increment the version — old cache entries become orphaned and eventually get evicted by the eviction policy.

# Version-based cache keys
def get_product(product_id):
    version = cache.get(f"product_version:{product_id}") or "1"
    cache_key = f"product:{product_id}:v{version}"
    
    cached = cache.get(cache_key)
    if cached:
        return cached
    
    product = database.get_product(product_id)
    cache.set(cache_key, product, ttl=86400)
    return product

def update_product(product_id, new_data):
    database.update_product(product_id, new_data)
    # Simply increment version — old cache entry becomes orphaned
    cache.incr(f"product_version:{product_id}")

6. Tag-Based Invalidation

Group related cache entries under tags, then invalidate all entries with a specific tag at once. This is powerful for invalidating cascading dependencies.

# Tag-based invalidation
def cache_with_tags(key, value, tags, ttl=3600):
    cache.set(key, value, ttl=ttl)
    for tag in tags:
        cache.sadd(f"tag:{tag}", key)  # Add key to tag set

def invalidate_tag(tag):
    keys = cache.smembers(f"tag:{tag}")
    if keys:
        cache.delete(*keys)          # Delete all tagged entries
        cache.delete(f"tag:{tag}")    # Clean up tag set

# Usage
cache_with_tags("user:1001:profile", profile_data, ["user:1001"])
cache_with_tags("user:1001:orders", orders_data, ["user:1001", "orders"])

# Invalidate everything related to user 1001
invalidate_tag("user:1001")

Comparison of Invalidation Strategies

Strategy	Consistency	Complexity	Performance Impact	Best For
TTL-Based	Eventual (up to TTL)	Low	None	General purpose, low-stakes data
Event-Based	Near-real-time	High	Minimal (async)	High-freshness requirements
Write-Through	Strong	Medium	Higher write latency	Read-heavy, consistency-critical
Write-Behind	Eventual	High	Fastest writes	Write-heavy, tolerates data loss risk
Version-Based	Strong (new reads)	Low-Medium	Extra key lookup	Avoiding explicit deletes
Tag-Based	Strong (on invalidation)	Medium	Batch delete overhead	Complex dependency graphs

Common Pitfalls and Race Conditions

The Delete-Then-Update Race

A classic race condition occurs when one process reads stale data from the database just before another process completes a write:

# Dangerous race condition
# Thread A: delete cache, then update DB
# Thread B: read from DB (gets OLD value), write to cache

# Timeline:
# T1: Thread A deletes cache key "user:1001"
# T2: Thread B cache miss, reads OLD value from DB
# T3: Thread A updates DB with NEW value
# T4: Thread B writes OLD value to cache
# Result: Cache has STALE data, DB has FRESH data

Solutions include using a short TTL as a safety net, adding a brief delay before cache repopulation, or using distributed locks.

Thundering Herd on Expiry

When a popular cache entry expires, hundreds of concurrent requests all experience a cache miss simultaneously and flood the database. This is the cache stampede problem. Solutions include probabilistic early expiration and request coalescing.

Real-World Invalidation Strategies

E-Commerce Product Catalog: Use TTL-based invalidation with a 1-hour TTL as the baseline. When an admin updates a product, publish an event-based invalidation to immediately refresh the cache. This hybrid approach provides both eventual consistency guarantees and real-time updates when needed.

Social Media Feeds: Use short TTLs (30-60 seconds) combined with write-through for posts. When a user creates a new post, update their followers' feed caches via fan-out. For viral content, use hot key strategies to avoid overwhelming a single cache node.

User Sessions: Use Redis with sliding TTL expiration. Each time a user makes a request, refresh the TTL using the EXPIRE command. Sessions that are idle for the TTL duration are automatically cleaned up.

Frequently Asked Questions

What happens if my invalidation event is lost?

This is why most systems use TTL as a safety net even with event-based invalidation. If an invalidation event is lost, the stale data will still expire when the TTL runs out. For critical data, consider using a reliable message queue with at-least-once delivery guarantees instead of fire-and-forget pub/sub.

Should I delete the cache entry or update it with the new value?

Deleting is generally safer and simpler. Updating (write-through) requires you to have the new value ready, and in distributed systems, an out-of-order update can overwrite newer data with older data. Deleting lets the next read naturally fetch fresh data from the source. However, for hot keys, updating avoids the cache miss thundering herd problem.

How do I invalidate across multiple cache layers?

Work from the outermost layer inward: invalidate CDN first, then distributed cache, then application-level cache. Use event-based invalidation with a fan-out pattern. Each cache layer subscribes to invalidation events and clears its own entries independently.

Is cache invalidation different from cache eviction?

Yes. Invalidation is the deliberate removal of data because it is stale or outdated. Eviction is the automatic removal of data to make room for new entries when the cache is full. Invalidation is driven by data changes; eviction is driven by memory pressure.

Cache Invalidation: Solving the Hardest Problem in Computer Science