A cache stampede occurs when a heavily requested cache key expires, causing hundreds or thousands of concurrent requests to hit your database at once. This massive spike often leads to connection pool exhaustion, increased latency, and total system failure. If your API serves high traffic, relying on simple "get or set" logic is a recipe for disaster.
You will learn how to implement distributed locks and probabilistic jitter to ensure your backend remains stable even during massive traffic surges. These strategies reduce database load by up to 99% during cache misses.
TL;DR — Prevent cache stampedes by using distributed locks (Redlock) to allow only one request to refresh the cache, and apply random jitter to TTLs to prevent multiple keys from expiring at the exact same moment.
1. What is a Redis Cache Stampede?
💡 Analogy: Imagine a popular coffee shop where the "Special of the Day" sign falls down. Instead of one customer telling the barista, 500 customers rush the counter simultaneously to ask what the special is. The lone barista is crushed by the crowd and the shop shuts down.
In technical terms, this is often called "dog-piling." When a "hot key" expires in Redis (latest stable 7.4), every application instance perceives a cache miss simultaneously. They all attempt to fetch the same data from the origin database to repopulate the cache.
The core problem isn't the cache miss itself, but the lack of coordination between your API nodes. Without a synchronization mechanism, your database receives a 1:1 ratio of requests to users for that specific millisecond, defeating the purpose of having a cache layer entirely.
2. Why High-Traffic APIs Fail Without Protection
In high-concurrency environments, even a 100ms database query can trigger a disaster. If you have 2,000 requests per second, a single cache expiry results in 2,000 simultaneous SQL queries. This leads to Database Connection Pool Exhaustion, where the application can no longer talk to the DB for any request, even those unrelated to the expired key.
You face this risk when dealing with global configuration keys, product catalog data during flash sales, or authentication tokens. If your TTL (Time to Live) values are static (e.g., exactly 3600 seconds), you also risk "expiry synchronization," where multiple related keys expire at once, creating a massive overhead spike every hour.
3. Implementing Distributed Locks and Jitter
You can solve this using a combination of mutual exclusion and probabilistic variance.
Step 1. Distributed Locking with Lua
Use a distributed lock to ensure only one worker updates the cache. A Lua script ensures the "check and set" operation is atomic.
-- lock.lua
-- KEYS[1] = lock_key, ARGV[1] = unique_id, ARGV[2] = ttl_ms
local status = redis.call('SET', KEYS[1], ARGV[1], 'NX', 'PX', ARGV[2])
return status
Step 2. Implementation Logic
The application should follow this pattern: Check cache -> If miss, try to acquire lock -> If lock acquired, fetch from DB and update cache -> If lock fails, wait and retry or return stale data.
async function getCachedData(key) {
let data = await redis.get(key);
if (data) return JSON.parse(data);
// Acquire lock
const lockKey = `lock:${key}`;
const locked = await redis.set(lockKey, "1", "NX", "PX", 5000);
if (locked) {
const dbData = await fetchFromDB(key);
// Add jitter to TTL (e.g., 60s + random 0-10s)
const ttl = 60 + Math.floor(Math.random() * 10);
await redis.set(key, JSON.stringify(dbData), "EX", ttl);
await redis.del(lockKey);
return dbData;
} else {
// Wait 100ms and retry
await new Promise(res => setTimeout(res, 100));
return getCachedData(key);
}
}
Step 3. Verifying with Jitter
Jitter prevents "thundering herds" of expiries. Instead of all keys expiring at 12:00:00, they expire between 11:59:50 and 12:00:10. You can verify this by checking the TTL of multiple keys using `TTL key_name` in the Redis CLI.
4. Distributed Locks vs. Jitter Optimization
Choosing the right method depends on your latency requirements and data consistency needs.
| Criteria | Distributed Locks | Random Jitter |
|---|---|---|
| Primary Goal | Strict Mutex (One DB hit) | Flattening Expiry Spikes |
| Implementation | Medium (Requires Lua/Redlock) | Easy (Math.random) |
| Latency | Slight increase on cache miss | No impact |
| Best For | Extremely heavy DB queries | General cache distribution |
If you have a massive dataset where keys are set at the same time, use Jitter. If your database query for a single key is expensive, use Distributed Locks.
5. Common Implementation Pitfalls
⚠️ Common Mistake: Setting a lock TTL that is shorter than the database query time. This results in the lock expiring while the query is still running, allowing a second "stampede" to start.
Another error is failing to release the lock in a `finally` block. If your application crashes after acquiring the lock but before releasing it, that key will remain un-refreshable until the lock TTL expires.
Troubleshooting by Error
// Error: Response timeout while waiting for lock
// Cause: Lock held by a slow DB query or dead process
// Fix: Increase lock TTL and implement a "watchdog" to extend lock if needed.
6. Production-Ready Best Practices
Apply a jitter of 10% to 20% of your base TTL. If your TTL is 300 seconds, add a random value between 0 and 60 seconds. This simple change distributes the load effectively across time.
For mission-critical "hot keys," consider Probabilistic Early Recomputation (X-Fetch). Instead of waiting for expiry, the application decides to refresh the cache early based on a probability function as the TTL nears zero.
📌 Key Takeaways
- Use SET NX with a unique ID for simple distributed locking.
- Always add Random Jitter to TTLs to prevent synchronized expiry spikes.
- Return stale data during a cache miss if a lock is already held to minimize user-facing latency.
Frequently Asked Questions
Q. Is Redlock necessary for cache stampede?
A. Usually no. Single-instance SET NX is enough unless you need extreme consistency.
Q. How much jitter is optimal?
A. 10% to 15% of the total TTL is the standard industry recommendation.
Q. Should I use background refresh?
A. Yes, background workers refreshing keys before they expire is the most reliable method.
Post a Comment