Lambda pre-aggregations
Lambda pre-aggregations follow the Lambda architecture (opens in a new tab) design to union real-time and batch data. Cube acts as a serving layer and uses pre-aggregations as a batch layer and source data or other pre-aggregations, usually streaming, as a speed layer. Due to this design, lambda pre-aggregations only work with data that is newer than the existing batched pre-aggregations.
Lambda pre-aggregations only work with Cube Store.
Use cases
Below we are looking at the most common examples of using lambda pre-aggregations.
Batch and source data
Batch data is coming from pre-aggregation and real-time data is coming from the data source.
First, you need to create pre-aggregations that will contain your batch data. In the following example, we call it batch
. Please note, it must have a time_dimension
and partition_granularity
specified. Cube will use these properties to union batch data with freshly-retrieved source data.
You may also control the batch part of your data with the build_range_start
and build_range_end
properties of a pre-aggregation to determine a specific window for your batched data.
Next, you need to create a lambda pre-aggregation. To do that, create pre-aggregation with type rollup_lambda
, specify rollups you would like to use with rollups
property, and finally set union_with_source_data: true
to use source data as a real-time layer.
Please make sure that the lambda pre-aggregation definition comes first when defining your pre-aggregations.
cubes: - name: users # ... pre_aggregations: - name: lambda type: rollup_lambda union_with_source_data: true rollups: - CUBE.batch - name: batch measures: - users.count dimensions: - users.name time_dimension: users.created_at granularity: day partition_granularity: day build_range_start: sql: SELECT '2020-01-01' build_range_end: sql: SELECT '2022-05-30'
Batch and streaming data
In this scenario, batch data is comes from one pre-aggregation and real-time data comes from a streaming pre-aggregation.
You can use lambda pre-aggregations to combine data from multiple pre-aggregations, where one pre-aggregation can have batch data and another streaming. Please note that build ranges of all rollups referenced by lambda rollup should have enough intersection between each other that anticipates partition build times for those rollups. Cube will maximize the coverage of the requested date range by partitions from different rollups. The first rollup in a list of referenced rollups that has a fully built partition for a particular date range will be used to serve this date range. The last rollup in a list will be used to cover the remaining uncovered part of a date range. Partitions of the last rollup will be used even if not completely built.
cubes: - name: streaming_users # This cube uses a streaming SQL data source such as ksqlDB # ... pre_aggregations: - name: streaming type: rollup measures: - CUBE.count dimensions: - CUBE.name time_dimension: CUBE.created_at granularity: day, partition_granularity: day - name: users # This cube uses a data source such as ClickHouse or BigQuery # ... pre_aggregations: - name: batch_streaming_lambda type: rollup_lambda rollups: - users.batch - streaming_users.streaming - name: batch type: rollup measures: - users.count dimensions: - users.name time_dimension: users.created_at granularity: day partition_granularity: day build_range_start: sql: SELECT '2020-01-01' build_range_end: sql: SELECT '2022-05-30'