Tiered Storages in ClickHouse
Quoting from the MergeTree documentation:
MergeTree family table engines can store data on multiple block devices. For example, it can be useful when the data of a certain table are implicitly split into “hot” and “cold”. The most recent data is regularly requested but requires only a small amount of space. On the contrary, the fat-tailed historical data is requested rarely.
When used with remote storage backends such as Amazon S3, this makes a very efficient storage scheme. It allows for storage policies, which allows data to be on local disks for a period of time and then move it to object storage.
An example configuration can look like this:
<storage_configuration> <disks> <fast_ssd> <path>/mnt/fast_ssd/clickhouse/</path> </fast_ssd> <gcs> <support_batch_delete>false</support_batch_delete> <type>s3</type> <endpoint>https://storage.googleapis.com/${BUCKET_NAME}/${ROOT_FOLDER}/</endpoint> <access_key_id>${SERVICE_ACCOUNT_HMAC_KEY}</access_key_id> <secret_access_key>${SERVICE_ACCOUNT_HMAC_SECRET}</secret_access_key> <metadata_path>/var/lib/clickhouse/disks/gcs/</metadata_path> </gcs> ... </disks> ... <policies> <move_from_local_disks_to_gcs> <!-- policy name --> <volumes> <hot> <!-- volume name --> <disk>fast_ssd</disk> <!-- disk name --> </hot> <cold> <disk>gcs</disk> </cold> </volumes> <move_factor>0.2</move_factor> <!-- The move factor determines when to move data from hot volume to cold. See ClickHouse docs for more details. --> </moving_from_ssd_to_hdd> .... </storage_configuration>
In this storage policy, two volumes are defined hot
and cold
. After the hot
volume is filled with occupancy of disk_size * move_factor
, the data is being moved to Google Cloud Storage (GCS).
If this storage policy is not the default, create tables by attaching the storage policies. For example:
CREATE TABLE key_value_table ( event_date Date, key String, value String, ) ENGINE = MergeTree ORDER BY (key) PARTITION BY toYYYYMM(event_date) SETTINGS storage_policy = 'move_from_local_disks_to_gcs'
This approach is possible with Table TTLs, which are also available with MergeTree table engine.
The ClickHouse documentation shows this feature in detail, in the example of implementing a hot - warm - cold architecture.
You can take a similar approach for the example shown above. First, adjust the storage policy:
<storage_configuration> ... <policies> <local_disk_and_gcs> <!-- policy name --> <volumes> <hot> <!-- volume name --> <disk>fast_ssd</disk> <!-- disk name --> </hot> <cold> <disk>gcs</disk> </cold> </volumes> </local_disk_and_gcs> .... </storage_configuration>
Then create the table as:
CREATE TABLE another_key_value_table ( event_date Date, key String, value String, ) ENGINE = MergeTree ORDER BY (key) PARTITION BY toYYYYMM(event_date) TTL event_date TO VOLUME 'hot', event_date + INTERVAL 1 YEAR TO VOLUME 'cold' SETTINGS storage_policy = 'local_disk_and_gcs';
This creates the table so that data older than 1 year (evaluated against the event_date
column) is moved to GCS. Such a storage policy can be helpful for append-only tables (like audit events) where only the most recent data is accessed frequently. You can drop the data altogether, which can be a regulatory requirement.
We don’t mention modifying TTLs in this guide, but that is possible as well. See ClickHouse documentation for modifying TTL for details.