Cache Invalidation: Solving the Hardest Problem in Computer Science
Phil Karlton famously said, "There are only two hard things in Computer Science: cache invalidation and naming things." While the quote is often shared as a joke, cache invalidation genuinely is one of the trickiest challenges in system design. Get it wrong, and your users see stale data, your systems become inconsistent, and debugging becomes a nightmare.
Cache invalidation is the process of removing or updating cached data when the underlying source data changes. This guide covers every major invalidation strategy, their trade-offs, and practical code examples to help you choose the right approach for your system.
Why Is Cache Invalidation So Hard?
The fundamental tension is between performance and consistency. Caching exists to avoid expensive operations, but the moment the source data changes, the cached copy becomes stale. The challenges multiply in distributed systems:
- Multiple cache layers: Browser cache, CDN, application cache, database query cache — which ones need invalidation?
- Race conditions: Two concurrent writes can leave the cache in an inconsistent state.
- Distributed coordination: Invalidating across multiple cache nodes requires reliable messaging.
- Cascading dependencies: Updating a user's name might require invalidating their profile cache, their comments cache, their friends' feed caches, and more.
Invalidation Strategies
1. TTL-Based Invalidation (Time to Live)
The simplest approach: every cached entry has an expiration time. After the TTL expires, the entry is automatically removed, and the next request fetches fresh data from the source.
# Redis TTL commands
SET user:1001 '{"name":"Alice","role":"admin"}' EX 300 # Expires in 5 minutes
SETEX product:5001 3600 '{"name":"Widget","price":29.99}' # Expires in 1 hour
# Check remaining TTL
TTL user:1001 # Returns seconds remaining (e.g., 287)
PTTL user:1001 # Returns milliseconds remaining
# Update TTL without changing value
EXPIRE user:1001 600 # Reset to 10 minutes
# Remove expiration (make persistent)
PERSIST user:1001
Choosing TTL values depends on your tolerance for staleness:
| Data Type | Suggested TTL | Reasoning |
|---|---|---|
| User session | 15-30 minutes | Security — limit exposure window |
| Product catalog | 1-24 hours | Changes infrequently, high read volume |
| Social media feed | 30-60 seconds | Users expect near-real-time updates |
| Configuration/feature flags | 1-5 minutes | Balance between freshness and reducing config service load |
| DNS records | 5 minutes - 48 hours | DNS propagation latency is expected |
Pros: Dead simple, self-healing (stale data eventually disappears), no coordination needed.
Cons: Data can be stale for up to the full TTL duration, not suitable when freshness is critical.
2. Event-Based Invalidation
Instead of waiting for TTL expiry, the system actively invalidates cache entries when the source data changes. This is typically implemented using a publish/subscribe mechanism or message queue.
# Event-based invalidation with Redis Pub/Sub
# Publisher (runs when data changes)
def update_product(product_id, new_data):
# 1. Update database
database.update("products", product_id, new_data)
# 2. Publish invalidation event
redis_client.publish("cache:invalidate", f"product:{product_id}")
# Subscriber (runs on each cache-holding server)
def listen_for_invalidations():
pubsub = redis_client.pubsub()
pubsub.subscribe("cache:invalidate")
for message in pubsub.listen():
if message["type"] == "message":
cache_key = message["data"]
local_cache.delete(cache_key)
redis_client.delete(cache_key)
print(f"Invalidated: {cache_key}")
Pros: Near-real-time invalidation, data freshness is excellent.
Cons: Added complexity, requires reliable messaging infrastructure, risk of missed events.
3. Write-Through Invalidation
In the write-through pattern, every write operation updates both the cache and the database simultaneously. The cache is always consistent with the database.
def write_through_update(key, value):
# Update cache and database atomically
try:
# Step 1: Write to database
database.update(key, value)
# Step 2: Update cache with new value
cache.set(key, value, ttl=3600)
return True
except DatabaseError:
# If DB write fails, remove stale cache entry
cache.delete(key)
raise
Pros: Strong consistency between cache and database, simple mental model.
Cons: Higher write latency (must write to both), not truly atomic without distributed transactions.
4. Write-Behind (Write-Back) Invalidation
The write-behind pattern writes to the cache immediately and asynchronously writes to the database later. This provides low write latency but risks data loss if the cache fails before the database write completes.
Pros: Very fast writes, batching reduces database load.
Cons: Risk of data loss, eventual consistency, complex implementation.
5. Version-Based Invalidation
Instead of deleting cache entries, you change the cache key by appending a version number. When data changes, increment the version — old cache entries become orphaned and eventually get evicted by the eviction policy.
# Version-based cache keys
def get_product(product_id):
version = cache.get(f"product_version:{product_id}") or "1"
cache_key = f"product:{product_id}:v{version}"
cached = cache.get(cache_key)
if cached:
return cached
product = database.get_product(product_id)
cache.set(cache_key, product, ttl=86400)
return product
def update_product(product_id, new_data):
database.update_product(product_id, new_data)
# Simply increment version — old cache entry becomes orphaned
cache.incr(f"product_version:{product_id}")
6. Tag-Based Invalidation
Group related cache entries under tags, then invalidate all entries with a specific tag at once. This is powerful for invalidating cascading dependencies.
# Tag-based invalidation
def cache_with_tags(key, value, tags, ttl=3600):
cache.set(key, value, ttl=ttl)
for tag in tags:
cache.sadd(f"tag:{tag}", key) # Add key to tag set
def invalidate_tag(tag):
keys = cache.smembers(f"tag:{tag}")
if keys:
cache.delete(*keys) # Delete all tagged entries
cache.delete(f"tag:{tag}") # Clean up tag set
# Usage
cache_with_tags("user:1001:profile", profile_data, ["user:1001"])
cache_with_tags("user:1001:orders", orders_data, ["user:1001", "orders"])
# Invalidate everything related to user 1001
invalidate_tag("user:1001")
Comparison of Invalidation Strategies
| Strategy | Consistency | Complexity | Performance Impact | Best For |
|---|---|---|---|---|
| TTL-Based | Eventual (up to TTL) | Low | None | General purpose, low-stakes data |
| Event-Based | Near-real-time | High | Minimal (async) | High-freshness requirements |
| Write-Through | Strong | Medium | Higher write latency | Read-heavy, consistency-critical |
| Write-Behind | Eventual | High | Fastest writes | Write-heavy, tolerates data loss risk |
| Version-Based | Strong (new reads) | Low-Medium | Extra key lookup | Avoiding explicit deletes |
| Tag-Based | Strong (on invalidation) | Medium | Batch delete overhead | Complex dependency graphs |
Common Pitfalls and Race Conditions
The Delete-Then-Update Race
A classic race condition occurs when one process reads stale data from the database just before another process completes a write:
# Dangerous race condition
# Thread A: delete cache, then update DB
# Thread B: read from DB (gets OLD value), write to cache
# Timeline:
# T1: Thread A deletes cache key "user:1001"
# T2: Thread B cache miss, reads OLD value from DB
# T3: Thread A updates DB with NEW value
# T4: Thread B writes OLD value to cache
# Result: Cache has STALE data, DB has FRESH data
Solutions include using a short TTL as a safety net, adding a brief delay before cache repopulation, or using distributed locks.
Thundering Herd on Expiry
When a popular cache entry expires, hundreds of concurrent requests all experience a cache miss simultaneously and flood the database. This is the cache stampede problem. Solutions include probabilistic early expiration and request coalescing.
Real-World Invalidation Strategies
E-Commerce Product Catalog: Use TTL-based invalidation with a 1-hour TTL as the baseline. When an admin updates a product, publish an event-based invalidation to immediately refresh the cache. This hybrid approach provides both eventual consistency guarantees and real-time updates when needed.
Social Media Feeds: Use short TTLs (30-60 seconds) combined with write-through for posts. When a user creates a new post, update their followers' feed caches via fan-out. For viral content, use hot key strategies to avoid overwhelming a single cache node.
User Sessions: Use Redis with sliding TTL expiration. Each time a user makes a request, refresh the TTL using the EXPIRE command. Sessions that are idle for the TTL duration are automatically cleaned up.
Frequently Asked Questions
What happens if my invalidation event is lost?
This is why most systems use TTL as a safety net even with event-based invalidation. If an invalidation event is lost, the stale data will still expire when the TTL runs out. For critical data, consider using a reliable message queue with at-least-once delivery guarantees instead of fire-and-forget pub/sub.
Should I delete the cache entry or update it with the new value?
Deleting is generally safer and simpler. Updating (write-through) requires you to have the new value ready, and in distributed systems, an out-of-order update can overwrite newer data with older data. Deleting lets the next read naturally fetch fresh data from the source. However, for hot keys, updating avoids the cache miss thundering herd problem.
How do I invalidate across multiple cache layers?
Work from the outermost layer inward: invalidate CDN first, then distributed cache, then application-level cache. Use event-based invalidation with a fan-out pattern. Each cache layer subscribes to invalidation events and clears its own entries independently.
Is cache invalidation different from cache eviction?
Yes. Invalidation is the deliberate removal of data because it is stale or outdated. Eviction is the automatic removal of data to make room for new entries when the cache is full. Invalidation is driven by data changes; eviction is driven by memory pressure.