Calculating nested aggregates

Use case

Sometimes, there's a need to calculate a double aggregation over a fact table. For example, if you have a line_items table that has store_id, order_id, and sales columns, you might wonder what is the median of sales per product for each store.

With an ad-hoc SQL query, this double aggregation would probably be expressed as follows:

WITH sales_per_store_product AS (  SELECT store_id, product_id, SUM(sales) AS sales  FROM line_items  GROUP BY 1, 2 )   SELECT store_id, PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY sales) AS sales_median FROM sales_per_store_product GROUP BY 1

Data modeling

In Cube, measures are used to define aggregates. However, a single measure can only contain a single aggregation, e.g., SUM, APPROX_COUNT_DISTINCT, or PERCENTILE_CONT.

If you'd like to define a double aggregation, e.g., a median of a sum of values, the outer aggregation would need to be defined in a separate cube and the inner aggregation (measure) would need to be brought to that cube as a subquery dimension. Also, these cubes would need to have a join definition between them.

Consider the following data model:

cubes:  - name: nested_agg_sales  sql: |  SELECT 1 AS id, 1 AS store_id, 1 AS product_id, 10 AS sales UNION ALL  SELECT 2 AS id, 1 AS store_id, 1 AS product_id, 20 AS sales UNION ALL  SELECT 3 AS id, 1 AS store_id, 2 AS product_id, 30 AS sales UNION ALL  SELECT 4 AS id, 1 AS store_id, 2 AS product_id, 40 AS sales UNION ALL  SELECT 5 AS id, 2 AS store_id, 1 AS product_id, 50 AS sales UNION ALL  SELECT 6 AS id, 2 AS store_id, 1 AS product_id, 60 AS sales UNION ALL  SELECT 7 AS id, 2 AS store_id, 2 AS product_id, 70 AS sales UNION ALL  SELECT 8 AS id, 2 AS store_id, 2 AS product_id, 80 AS sales    dimensions:  - name: id  sql: id  type: number  primary_key: true    - name: store_id  sql: store_id  type: number    - name: product_id  sql: product_id  type: number    - name: store_product_id  sql: "CONCAT({store_id}, '-', {product_id})"  type: string    measures:  - name: sales  sql: sales  type: sum    - name: nested_agg_stores_orders  sql: |  SELECT store_id, product_id  FROM (  SELECT 1 AS id, 1 AS store_id, 1 AS product_id, 10 AS sales UNION ALL  SELECT 2 AS id, 1 AS store_id, 1 AS product_id, 20 AS sales UNION ALL  SELECT 3 AS id, 1 AS store_id, 2 AS product_id, 30 AS sales UNION ALL  SELECT 4 AS id, 1 AS store_id, 2 AS product_id, 40 AS sales UNION ALL  SELECT 5 AS id, 2 AS store_id, 1 AS product_id, 50 AS sales UNION ALL  SELECT 6 AS id, 2 AS store_id, 1 AS product_id, 60 AS sales UNION ALL  SELECT 7 AS id, 2 AS store_id, 2 AS product_id, 70 AS sales UNION ALL  SELECT 8 AS id, 2 AS store_id, 2 AS product_id, 80 AS sales  ) AS raw  GROUP BY 1, 2    joins:  - name: nested_agg_sales  sql: "{nested_agg_stores_orders.store_product_id} = {nested_agg_sales.store_product_id}"  relationship: one_to_many    dimensions:  - name: store_id  sql: store_id  type: number    - name: product_id  sql: product_id  type: number    - name: store_product_id  sql: "CONCAT({store_id}, '-', {product_id})"  type: string  primary_key: true    - name: sales_sum  sql: "{nested_agg_sales.sales}"  type: number  sub_query: true    measures:  - name: median_sales  sql: "PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY {sales_sum})"  type: number

As you can see, the sum of sales for per store and per product is defined in the nested_agg_sales cube as the sales measure. Then, it is brought to the nested_agg_stores_orders cube as sales_sum that is defined as a subquery dimension. Also, a join is defined between both cubes.

Then, the median of sales is defined as the median_sales measure in the nested_agg_stores_orders cube. It’s OK to reference sales_sum in this measure because now it's a dimension; referencing a measure from another cube here would not work.

Result

Querying the median_sales measure would give the expected result:

We can verify that it's correct by adding one more dimension to the query:

Averages and percentiles Filtered aggregates