When working with massive streams of data like hashtags from tweets, storing and processing every individual hashtag becomes impractical due to high memory usage and real-time constraints.
👉 The goal:
Find the top-K trending hashtags in real time, from a continuous stream of tweets.
✅ Problem Statement
A large-scale tweet stream sends thousands of hashtags per second.
Your task: Continuously identify the most popular hashtags trending right now.
✅ Challenges:
- Huge volume of hashtags makes exact counting infeasible.
- Limited memory and need for fast processing.
- Real-time approximate results with good accuracy.
🧐 Two Practical Approaches
1️⃣ Multiple CMS Time Window Approach
✔️ What It Solves:
Enables answering questions like:
- “What are the top trending hashtags in the last 15 minutes?”
- “What are the top trending hashtags in the last 30 minutes?”
- “What are the top trending hashtags in the last 1 hour?”
✔️ How It Works:
- Maintain multiple CMS instances, one per fixed time window (e.g., one CMS per minute).
- Keep only the latest N CMS instances corresponding to the desired time window (sliding window).
- At query time, sum counts across the relevant CMS instances.
✅ Java Implementation Example:
public class SlidingWindowCMS { private final int windowSize; private final LinkedList<CountMinSketch> cmsList; public SlidingWindowCMS(int windowSize, int rows, int cols) { this.windowSize = windowSize; this.cmsList = new LinkedList<>(); } public void addNewMinuteCMS(CountMinSketch newCms) { if (cmsList.size() >= windowSize) { cmsList.removeFirst(); } cmsList.addLast(newCms); } public int query(String key) { int totalCount = 0; for (CountMinSketch cms : cmsList) { totalCount += cms.count(key); } return totalCount; } } 2️⃣ Decaying Count Approach
✔️ What It Solves:
Answers questions such as:
- “What are the currently trending hashtags, giving more weight to recent data?”
✔️ How It Works:
- Use a single CMS instance.
- Periodically apply a decay factor to all counters (e.g., multiply by 0.99 every minute).
- Recent hashtags remain significant while older counts fade automatically.
- Eliminates the need for storing multiple CMS instances.
✅ Java Implementation Example:
public class DecayingCMS { private final CountMinSketch cms; private final double decayFactor; public DecayingCMS(int rows, int cols, double decayFactor) { this.cms = new CountMinSketch(rows, cols); this.decayFactor = decayFactor; } public void add(String key) { cms.add(key); } public int query(String key) { return cms.count(key); } public void applyDecay() { cms.applyDecay(decayFactor); } } ✅ Count-Min Sketch Implementation Example:
public class CountMinSketch { private final int[][] table; private final int[] seeds; private final int rows; private final int cols; private final Random rand = new Random(); public CountMinSketch(int rows, int cols) { this.rows = rows; this.cols = cols; this.table = new int[rows][cols]; this.seeds = new int[rows]; for (int i = 0; i < rows; i++) seeds[i] = rand.nextInt(); } public void add(String key) { for (int i = 0; i < rows; i++) { int index = Math.abs(hash(key, seeds[i]) % cols); table[i][index]++; } } public int count(String key) { int min = Integer.MAX_VALUE; for (int i = 0; i < rows; i++) { int index = Math.abs(hash(key, seeds[i]) % cols); min = Math.min(min, table[i][index]); } return min; } public void applyDecay(double factor) { for (int i = 0; i < rows; i++) for (int j = 0; j < cols; j++) table[i][j] *= factor; } private int hash(String key, int seed) { int hash = 0; for (char c : key.toCharArray()) { hash = hash * seed + c; } return hash; } } ✅ Conclusion
When processing streams of hashtags in real time:
- ✅ Use Multiple CMS Time Windows if you need exact top-K in fixed time windows (e.g., last 15 min, 1 hour).
- ✅ Use Decaying Counts CMS for smooth, approximate real-time trending insights without separate windows.
Both approaches solve real-world problems depending on your requirements.
👉 For a deeper overview of how Count-Min Sketch works in general, check out our detailed CMS overview 👉 Count-Min Sketch Overview.
Top comments (0)