DEV Community

Cover image for How to Find Top-K Trending Hashtags from a Stream of Tweets Using Count-Min Sketch
VipraTech Solutions for VipraTech Solutions

Posted on • Edited on

How to Find Top-K Trending Hashtags from a Stream of Tweets Using Count-Min Sketch

When working with massive streams of data like hashtags from tweets, storing and processing every individual hashtag becomes impractical due to high memory usage and real-time constraints.

👉 The goal:

Find the top-K trending hashtags in real time, from a continuous stream of tweets.


✅ Problem Statement

A large-scale tweet stream sends thousands of hashtags per second.

Your task: Continuously identify the most popular hashtags trending right now.

✅ Challenges:

  • Huge volume of hashtags makes exact counting infeasible.
  • Limited memory and need for fast processing.
  • Real-time approximate results with good accuracy.

🧐 Two Practical Approaches

1️⃣ Multiple CMS Time Window Approach

✔️ What It Solves:

Enables answering questions like:

  • “What are the top trending hashtags in the last 15 minutes?”
  • “What are the top trending hashtags in the last 30 minutes?”
  • “What are the top trending hashtags in the last 1 hour?”

✔️ How It Works:

  • Maintain multiple CMS instances, one per fixed time window (e.g., one CMS per minute).
  • Keep only the latest N CMS instances corresponding to the desired time window (sliding window).
  • At query time, sum counts across the relevant CMS instances.

✅ Java Implementation Example:

public class SlidingWindowCMS { private final int windowSize; private final LinkedList<CountMinSketch> cmsList; public SlidingWindowCMS(int windowSize, int rows, int cols) { this.windowSize = windowSize; this.cmsList = new LinkedList<>(); } public void addNewMinuteCMS(CountMinSketch newCms) { if (cmsList.size() >= windowSize) { cmsList.removeFirst(); } cmsList.addLast(newCms); } public int query(String key) { int totalCount = 0; for (CountMinSketch cms : cmsList) { totalCount += cms.count(key); } return totalCount; } } 
Enter fullscreen mode Exit fullscreen mode

2️⃣ Decaying Count Approach

✔️ What It Solves:

Answers questions such as:

  • “What are the currently trending hashtags, giving more weight to recent data?”

✔️ How It Works:

  • Use a single CMS instance.
  • Periodically apply a decay factor to all counters (e.g., multiply by 0.99 every minute).
  • Recent hashtags remain significant while older counts fade automatically.
  • Eliminates the need for storing multiple CMS instances.

✅ Java Implementation Example:

public class DecayingCMS { private final CountMinSketch cms; private final double decayFactor; public DecayingCMS(int rows, int cols, double decayFactor) { this.cms = new CountMinSketch(rows, cols); this.decayFactor = decayFactor; } public void add(String key) { cms.add(key); } public int query(String key) { return cms.count(key); } public void applyDecay() { cms.applyDecay(decayFactor); } } 
Enter fullscreen mode Exit fullscreen mode

✅ Count-Min Sketch Implementation Example:

public class CountMinSketch { private final int[][] table; private final int[] seeds; private final int rows; private final int cols; private final Random rand = new Random(); public CountMinSketch(int rows, int cols) { this.rows = rows; this.cols = cols; this.table = new int[rows][cols]; this.seeds = new int[rows]; for (int i = 0; i < rows; i++) seeds[i] = rand.nextInt(); } public void add(String key) { for (int i = 0; i < rows; i++) { int index = Math.abs(hash(key, seeds[i]) % cols); table[i][index]++; } } public int count(String key) { int min = Integer.MAX_VALUE; for (int i = 0; i < rows; i++) { int index = Math.abs(hash(key, seeds[i]) % cols); min = Math.min(min, table[i][index]); } return min; } public void applyDecay(double factor) { for (int i = 0; i < rows; i++) for (int j = 0; j < cols; j++) table[i][j] *= factor; } private int hash(String key, int seed) { int hash = 0; for (char c : key.toCharArray()) { hash = hash * seed + c; } return hash; } } 
Enter fullscreen mode Exit fullscreen mode

✅ Conclusion

When processing streams of hashtags in real time:

  • ✅ Use Multiple CMS Time Windows if you need exact top-K in fixed time windows (e.g., last 15 min, 1 hour).
  • ✅ Use Decaying Counts CMS for smooth, approximate real-time trending insights without separate windows.

Both approaches solve real-world problems depending on your requirements.

👉 For a deeper overview of how Count-Min Sketch works in general, check out our detailed CMS overview 👉 Count-Min Sketch Overview.

Top comments (0)