TL;DR
4 Cache Disasters & Go Solutions:
- Thunder Hurd: Random TTL jitter prevents mass expiration
- Cache Penetration: Cache "null" results for non-existent data
- Cache Breakdown: Never expire hot keys, use background refresh
- Cache Crash: Circuit breakers + rate limiting for graceful degradation Golden Rule: Your cache strategy must work when caches fail, not just when they succeed.
Table of Contents
- Thunder Hurd Problem: When Cache Misses Attack in Waves
- Cache Penetration: The Non-Existent Key Problem
- Cache Breakdown: When Hot Keys Expire
- Cache Crash: Building Resilient Systems
- Best Practices for Robust Caching
- Conclusion
Caching is one of the most powerful tools in a developer's arsenal for improving application performance. By storing frequently accessed data in fast, temporary storage, we can dramatically reduce response times and database load. However, caching isn't a silver bullet – when implemented incorrectly, it can create serious performance bottlenecks and system failures.
In this post, we'll explore four critical caching problems that can bring your system to its knees, along with practical solutions to prevent them.
1. Thunder Hurd Problem: When Cache Misses Attack in Waves
The Problem
The Thunder Hurd problem occurs when a large number of cache keys expire simultaneously, causing a massive wave of concurrent requests to hit your database all at once. Imagine your Redis cache contains thousands of user session keys that were all created during peak traffic hours and set with the same TTL (Time To Live).
When these keys expire simultaneously:
- Multiple application instances detect cache misses
- All instances simultaneously query the database for the same data
- Database gets overwhelmed with concurrent queries
- System performance degrades significantly
The Solution
Set Random Expiry Times: Instead of using fixed TTL values, add randomization to prevent synchronized expiration:
package main import ( "context" "math/rand" "time" "github.com/go-redis/redis/v8" ) type CacheService struct { client *redis.Client } func NewCacheService() *CacheService { rdb := redis.NewClient(&redis.Options{ Addr: "localhost:6379", }) return &CacheService{client: rdb} } func (c *CacheService) SetCacheWithJitter(ctx context.Context, key, value string, baseTTL time.Duration) error { // Add 0-20% jitter to base TTL jitterPercent := rand.Float64() * 0.2 jitter := time.Duration(float64(baseTTL) * jitterPercent) actualTTL := baseTTL + jitter return c.client.Set(ctx, key, value, actualTTL).Err() }
This simple technique spreads cache expiration over time, preventing the thundering herd effect.
2. Cache Penetration: The Non-Existent Key Problem
The Problem
Cache penetration happens when your application repeatedly requests data that doesn't exist in either the cache or the database. This creates a perfect storm:
- Application checks cache → miss
- Application queries database → no results
- No data gets cached (because it doesn't exist)
- Process repeats for every request
Malicious users can exploit this by repeatedly requesting non-existent resources, effectively bypassing your cache layer and hammering your database directly.
The Solution
Cache Empty Results with Bloom Filters:
Cache null/empty results for a short period:
package main import ( "context" "encoding/json" "time" "github.com/go-redis/redis/v8" ) type User struct { ID int `json:"id"` Name string `json:"name"` } type UserService struct { cache *redis.Client db Database // Assume this interface exists } func (s *UserService) GetUserData(ctx context.Context, userID int) (*User, error) { key := fmt.Sprintf("user:%d", userID) // Check cache first cachedData, err := s.cache.Get(ctx, key).Result() if err == nil { if cachedData == "null" { return nil, nil // Cached miss } var user User if err := json.Unmarshal([]byte(cachedData), &user); err == nil { return &user, nil } } // Query database user, err := s.db.GetUser(ctx, userID) if err != nil { return nil, err } // Cache the result (even if nil) if user != nil { userData, _ := json.Marshal(user) s.cache.Set(ctx, key, string(userData), time.Hour) } else { // Cache miss for 5 minutes s.cache.Set(ctx, key, "null", 5*time.Minute) } return user, nil }
Implement Bloom Filters to quickly identify non-existent keys before hitting the database.
package main import ( "context" "hash/fnv" ) // Simple Bloom Filter implementation type BloomFilter struct { bitArray []bool size uint hashFunctions int } func NewBloomFilter(size uint, hashFunctions int) *BloomFilter { return &BloomFilter{ bitArray: make([]bool, size), size: size, hashFunctions: hashFunctions, } } func (bf *BloomFilter) Add(item string) { for i := 0; i < bf.hashFunctions; i++ { hash := bf.hash(item, uint(i)) % bf.size bf.bitArray[hash] = true } } func (bf *BloomFilter) MightContain(item string) bool { for i := 0; i < bf.hashFunctions; i++ { hash := bf.hash(item, uint(i)) % bf.size if !bf.bitArray[hash] { return false } } return true } func (bf *BloomFilter) hash(item string, seed uint) uint { h := fnv.New32a() h.Write([]byte(item)) h.Write([]byte{byte(seed)}) return uint(h.Sum32()) } // Enhanced UserService with Bloom Filter type EnhancedUserService struct { cache *redis.Client db Database bloomFilter *BloomFilter } func (s *EnhancedUserService) GetUserData(ctx context.Context, userID int) (*User, error) { key := fmt.Sprintf("user:%d", userID) // Check bloom filter first if !s.bloomFilter.MightContain(key) { // Definitely doesn't exist, cache the miss s.cache.Set(ctx, key, "null", 5*time.Minute) return nil, nil } // Continue with normal cache/database flow return s.getUserDataNormal(ctx, userID) }
3. Cache Breakdown: When Hot Keys Expire
The Problem
Cache breakdown occurs when a highly accessed "hot key" expires, causing a sudden surge of requests to the database for that specific piece of data. Unlike the Thunder Hurd problem (which affects multiple keys), cache breakdown focuses on a single, critical piece of data.
Consider a popular product page on an e-commerce site. When its cache entry expires:
- Hundreds of concurrent users request the same product
- All requests miss the cache
- Database gets bombarded with identical queries
- System performance suffers until the cache is repopulated
The Solution
Never Set Expiry for Hot Keys: For critical, frequently accessed data, consider these strategies:
No expiration with manual invalidation:
package main import ( "context" "encoding/json" "fmt" "github.com/go-redis/redis/v8" ) type HotDataService struct { cache *redis.Client db Database } func (s *HotDataService) UpdateHotData(ctx context.Context, key string, newValue interface{}) error { // Update database first if err := s.db.Update(ctx, key, newValue); err != nil { return err } // Then update cache without expiry data, err := json.Marshal(newValue) if err != nil { return err } hotKey := fmt.Sprintf("hot:%s", key) return s.cache.Set(ctx, hotKey, string(data), 0).Err() // 0 = no expiration }
Background refresh before expiration:
package main import ( "context" "encoding/json" "fmt" "time" "github.com/go-redis/redis/v8" ) type BackgroundRefreshService struct { cache *redis.Client db Database } func (s *BackgroundRefreshService) GetHotDataWithRefresh(ctx context.Context, key string) (interface{}, error) { hotKey := fmt.Sprintf("hot:%s", key) refreshKey := fmt.Sprintf("refresh:%s", key) // Check if data exists in cache cachedData, err := s.cache.Get(ctx, hotKey).Result() if err == nil { // Check if refresh is needed (before actual expiry) lastRefresh, err := s.cache.Get(ctx, refreshKey).Result() if err != nil || s.shouldRefresh(lastRefresh) { // Trigger background refresh go s.refreshHotKey(context.Background(), key) } var result interface{} json.Unmarshal([]byte(cachedData), &result) return result, nil } // Fallback to database if cache truly missing return s.getFromDatabaseAndCache(ctx, key) } func (s *BackgroundRefreshService) shouldRefresh(lastRefreshStr string) bool { if lastRefreshStr == "" { return true } lastRefresh, err := time.Parse(time.RFC3339, lastRefreshStr) if err != nil { return true } return time.Since(lastRefresh) > 50*time.Minute } func (s *BackgroundRefreshService) refreshHotKey(ctx context.Context, key string) { // Implementation for background refresh data, err := s.db.Get(ctx, key) if err != nil { return } jsonData, _ := json.Marshal(data) hotKey := fmt.Sprintf("hot:%s", key) refreshKey := fmt.Sprintf("refresh:%s", key) s.cache.Set(ctx, hotKey, string(jsonData), time.Hour) s.cache.Set(ctx, refreshKey, time.Now().Format(time.RFC3339), time.Hour) } func (s *BackgroundRefreshService) getFromDatabaseAndCache(ctx context.Context, key string) (interface{}, error) { // Fallback implementation data, err := s.db.Get(ctx, key) if err != nil { return nil, err } jsonData, _ := json.Marshal(data) hotKey := fmt.Sprintf("hot:%s", key) s.cache.Set(ctx, hotKey, string(jsonData), time.Hour) return data, nil }
4. Cache Crash: Building Resilient Systems
The Problem
Cache crash is perhaps the most catastrophic scenario – your entire cache system (Redis cluster, Memcached, etc.) becomes unavailable. When this happens:
- All cache requests fail
- Traffic redirects entirely to your database
- Database becomes overwhelmed and may crash
- Cascading failures throughout your system
The Solution
Implement Circuit Breakers and Highly Available Cache Clusters:
- Circuit Breaker Pattern:
package main import ( "context" "encoding/json" "sync" "time" "github.com/go-redis/redis/v8" ) type CircuitState int const ( StateClosed CircuitState = iota StateOpen StateHalfOpen ) type CacheCircuitBreaker struct { client *redis.Client failureThreshold int timeout time.Duration failureCount int lastFailureTime time.Time state CircuitState mutex sync.RWMutex } func NewCacheCircuitBreaker(client *redis.Client) *CacheCircuitBreaker { return &CacheCircuitBreaker{ client: client, failureThreshold: 5, timeout: 60 * time.Second, state: StateClosed, } } func (cb *CacheCircuitBreaker) Get(ctx context.Context, key string) (string, error) { cb.mutex.RLock() state := cb.state cb.mutex.RUnlock() if state == StateOpen { cb.mutex.RLock() timeSinceLastFailure := time.Since(cb.lastFailureTime) cb.mutex.RUnlock() if timeSinceLastFailure > cb.timeout { cb.mutex.Lock() cb.state = StateHalfOpen cb.mutex.Unlock() } else { return "", fmt.Errorf("circuit breaker is open") } } result, err := cb.client.Get(ctx, key).Result() if err != nil { cb.recordFailure() return "", err } if state == StateHalfOpen { cb.reset() } return result, nil } func (cb *CacheCircuitBreaker) recordFailure() { cb.mutex.Lock() defer cb.mutex.Unlock() cb.failureCount++ cb.lastFailureTime = time.Now() if cb.failureCount >= cb.failureThreshold { cb.state = StateOpen } } func (cb *CacheCircuitBreaker) reset() { cb.mutex.Lock() defer cb.mutex.Unlock() cb.state = StateClosed cb.failureCount = 0 }
Graceful Degradation:
package main import ( "context" "encoding/json" "time" "golang.org/x/time/rate" ) type FallbackService struct { circuitBreaker *CacheCircuitBreaker rateLimiter *rate.Limiter db Database } func NewFallbackService(cb *CacheCircuitBreaker, db Database) *FallbackService { // Rate limiter: 100 requests per second with burst of 10 limiter := rate.NewLimiter(rate.Limit(100), 10) return &FallbackService{ circuitBreaker: cb, rateLimiter: limiter, db: db, } } func (s *FallbackService) GetDataWithFallback(ctx context.Context, key string) (interface{}, error) { // Try cache with circuit breaker cachedData, err := s.circuitBreaker.Get(ctx, key) if err == nil { var result interface{} if err := json.Unmarshal([]byte(cachedData), &result); err == nil { return result, nil } } // Fallback to database with rate limiting if !s.rateLimiter.Allow() { return nil, fmt.Errorf("rate limit exceeded for database fallback") } return s.getFromDatabaseWithRateLimit(ctx, key) } func (s *FallbackService) getFromDatabaseWithRateLimit(ctx context.Context, key string) (interface{}, error) { // Add timeout for database calls dbCtx, cancel := context.WithTimeout(ctx, 5*time.Second) defer cancel() data, err := s.db.Get(dbCtx, key) if err != nil { return nil, err } return data, nil }
Best Practices for Robust Caching
- Monitor Cache Hit Rates: Maintain visibility into your cache performance
- Implement Cache Warming: Pre-populate cache with critical data
- Use Multi-Level Caching: Combine local and distributed caches
- Plan for Cache Invalidation: Design clear strategies for updating stale data
- Load Test Cache Failure Scenarios: Regularly test how your system behaves when caches fail
Conclusion
Caching is incredibly powerful, but these four problems – Thunder Hurd, Cache Penetration, Cache Breakdown, and Cache Crash – can turn your performance optimization into a performance nightmare. By understanding these issues and implementing the solutions we've discussed, you can build more resilient, performant systems that gracefully handle cache failures.
Notes: The best cache strategy is one that works well both when the cache is healthy and when it's not. Plan for failure, monitor your systems, and always have fallback mechanisms in place.
Have you encountered any of these caching problems in your systems? Share your experiences and solutions in the comments below.
Top comments (0)