You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+73-51Lines changed: 73 additions & 51 deletions
Original file line number
Diff line number
Diff line change
@@ -55,23 +55,25 @@ Usage
55
55
"Hello World"
56
56
-------------
57
57
58
-
--- Make a dummy table
59
-
CREATE TABLE helloworld (
60
-
id integer,
61
-
set hll
62
-
);
58
+
```sql
59
+
--- Make a dummy table
60
+
CREATETABLEhelloworld (
61
+
id integer,
62
+
set hll
63
+
);
63
64
64
-
--- Insert an empty HLL
65
-
INSERT INTO helloworld(id, set) VALUES (1, hll_empty());
65
+
--- Insert an empty HLL
66
+
INSERT INTO helloworld(id, set) VALUES (1, hll_empty());
66
67
67
-
--- Add a hashed integer to the HLL
68
-
UPDATE helloworld SET set = hll_add(set, hll_hash_integer(12345)) WHERE id = 1;
68
+
--- Add a hashed integer to the HLL
69
+
UPDATE helloworld SETset= hll_add(set, hll_hash_integer(12345)) WHERE id =1;
69
70
70
-
--- Or add a hashed string to the HLL
71
-
UPDATE helloworld SET set = hll_add(set, hll_hash_text('hello world')) WHERE id = 1;
71
+
--- Or add a hashed string to the HLL
72
+
UPDATE helloworld SETset= hll_add(set, hll_hash_text('hello world')) WHERE id =1;
72
73
73
-
--- Get the cardinality of the HLL
74
-
SELECT hll_cardinality(set) FROM helloworld WHERE id = 1;
74
+
--- Get the cardinality of the HLL
75
+
SELECT hll_cardinality(set) FROM helloworld WHERE id =1;
76
+
```
75
77
76
78
Now with the silly stuff out of the way, here's a more realistic use case.
77
79
@@ -80,56 +82,70 @@ Data Warehouse Use Case
80
82
81
83
Let's assume I've got a fact table that records users' visits to my site, what they did, and where they came from. It's got hundreds of millions of rows. Table scans take minutes (or at least lots and lots of seconds.)
82
84
83
-
CREATE TABLE facts (
84
-
date date,
85
-
user_id integer,
86
-
activity_type smallint,
87
-
referrer varchar(255)
88
-
);
85
+
```sql
86
+
CREATETABLEfacts (
87
+
datedate,
88
+
user_id integer,
89
+
activity_type smallint,
90
+
referrer varchar(255)
91
+
);
92
+
```
89
93
90
94
I'd really like a quick (milliseconds) idea of how many unique users are visiting per day for my dashboard. No problem, let's set up an aggregate table:
We're first hashing the `user_id`, then aggregating those hashed values into one `hll` per day. Now we can ask for the cardinality of the `hll` for each day:
105
111
106
-
SELECT date, hll_cardinality(users) FROM daily_uniques;
112
+
```sql
113
+
SELECTdate, hll_cardinality(users) FROM daily_uniques;
114
+
```
107
115
108
116
You're probably thinking, "But I could have done this with `COUNT DISTINCT`!" And you're right, you could have. But then you only ever answer a single question: "How many unique users did I see each day?"
109
117
110
118
What if you wanted to this week's uniques?
111
119
112
-
SELECT hll_cardinality(hll_union_agg(users)) FROM daily_uniques WHERE date >= '2012-01-02'::date AND date <= '2012-01-08'::date;
120
+
```sql
121
+
SELECT hll_cardinality(hll_union_agg(users)) FROM daily_uniques WHEREdate>='2012-01-02'::dateANDdate<='2012-01-08'::date;
122
+
```
113
123
114
124
Or the monthly uniques for this year?
115
125
116
-
SELECT EXTRACT(MONTH FROM date) AS month, hll_cardinality(hll_union_agg(users))
117
-
FROM daily_uniques
118
-
WHERE date >= '2012-01-01' AND
119
-
date < '2013-01-01'
120
-
GROUP BY 1;
126
+
```sql
127
+
SELECT EXTRACT(MONTH FROMdate) AS month, hll_cardinality(hll_union_agg(users))
128
+
FROM daily_uniques
129
+
WHEREdate>='2012-01-01'AND
130
+
date<'2013-01-01'
131
+
GROUP BY1;
132
+
```
121
133
122
134
Or how about a sliding window of uniques over the past 6 days?
123
135
124
-
SELECT date, #hll_union_agg(users) OVER seven_days
125
-
FROM daily_uniques
126
-
WINDOW seven_days AS (ORDER BY date ASC ROWS 6 PRECEDING);
136
+
```sql
137
+
SELECTdate, #hll_union_agg(users) OVER seven_days
138
+
FROM daily_uniques
139
+
WINDOW seven_days AS (ORDER BYdateASC ROWS 6 PRECEDING);
140
+
```
127
141
128
142
Or the number of uniques you saw yesterday that you didn't see today?
129
143
130
-
SELECT date, (#hll_union_agg(users) OVER two_days) - #users AS lost_uniques
131
-
FROM daily_uniques
132
-
WINDOW two_days AS (ORDER BY date ASC ROWS 1 PRECEDING);
144
+
```sql
145
+
SELECTdate, (#hll_union_agg(users) OVER two_days) - #users AS lost_uniques
146
+
FROM daily_uniques
147
+
WINDOW two_days AS (ORDER BYdateASC ROWS 1 PRECEDING);
148
+
```
133
149
134
150
These are just a few examples of the types of queries that would return in milliseconds in an `hll` world from a single aggregate, but would require either completely separate pre-built aggregates or self-joins or `generate_series` trickery in a `COUNT DISTINCT` world.
135
151
@@ -278,23 +294,29 @@ Aggregate functions
278
294
279
295
If you want to create a `hll` from a table or result set, use `hll_add_agg`. The naming here isn't particularly creative: it's an **agg**regate function that **add**s the values to an empty `hll`.
The above example will give you a `hll` for each date that contains each day's users.
286
304
287
305
If you want to summarize a list of `hll`s that you already have stored into a single `hll`, use `hll_union_agg`. Again: it's an **agg**regate function that **union**s the values into an empty `hll`.
288
306
289
-
SELECT EXTRACT(MONTH FROM date), hll_cardinality(hll_union_agg(users))
Sliding windows are another prime example of the power of `hll`s. Doing sliding window unique counting typically involves some `generate_series` trickery, but it's quite simple with the `hll`s you've already computed for your roll-ups.
294
314
295
-
SELECT date, #hll_union_agg(users) OVER seven_days
296
-
FROM daily_uniques
297
-
WINDOW seven_days AS (ORDER BY date ASC ROWS 6 PRECEDING);
315
+
```sql
316
+
SELECTdate, #hll_union_agg(users) OVER seven_days
317
+
FROM daily_uniques
318
+
WINDOW seven_days AS (ORDER BYdateASC ROWS 6 PRECEDING);
0 commit comments