Better Than You Think JSON Data in ClickHouse Robert Hodges Altinity 1 V 2021-09
© 2021 Altinity, Inc. Presenter and Company Bio www.altinity.com Enterprise provider for ClickHouse, a popular, open source data warehouse. We run Altinity.Cloud, the first managed ClickHouse in AWS and GCP Robert Hodges - Altinity CEO 30+ years on DBMS plus virtualization and security. Using ClickHouse since 2019. 2
© 2021 Altinity, Inc. JSON is pervasive as raw data Web site access log in JSON format: { "@timestamp": 895873059, "clientip": "54.72.5.0", "request": "GET /images/home_bg_stars.gif HTTP/1.1", "status": 200, "size": 2557 }
© 2021 Altinity, Inc. How to model JSON data in ClickHouse SQL Table Tabular: every key is a column SQL Table Array of Keys Arrays: Header values with key-value pairs Array of Values SQL Table JSON Blob JSON: Header values with JSON string (“blob”)
© 2021 Altinity, Inc. Simplest design maps all keys to columns CREATE TABLE http_logs_tabular ( `@timestamp` DateTime, `clientip` IPv4, `status` UInt16, `request` String, `size` UInt32 ) ENGINE = MergeTree PARTITION BY toStartOfDay(`@timestamp`) ORDER BY `@timestamp` SETTINGS index_granularity = 8192
© 2021 Altinity, Inc. Loading *might* be this easy head http_logs.json {"@timestamp": 895873059, "clientip":"54.72.5.0", "request": "GET /images/home_bg_stars.gif HTTP/1.1", "status": 200, "size": 2557} {"@timestamp": 895873059, "clientip":"53.72.5.0", "request": "GET /images/home_tool.gif HTTP/1.0", "status": 200, "size": 327} ... clickhouse-client --query 'INSERT INTO http_logs_tabular Format JSONEachRow' < http_logs_tabular
© 2021 Altinity, Inc. But what happens if... You don’t know the columns up front? JSON keys and values differ between records? JSON values are dirty or don’t map trivially to columns? 7 How can I handle that in ClickHouse??
© 2021 Altinity, Inc. Start by storing the JSON as text CREATE TABLE http_logs ( `file` String, `message` String ) ENGINE = MergeTree PARTITION BY file ORDER BY tuple() SETTINGS index_granularity = 8192 “Blob”
© 2021 Altinity, Inc. Load data in whatever way is easiest... head http_logs.csv "file","message" "documents-211998.json","{""@timestamp"": 895873059, ""clientip"":""54.72.5.0"", ""request"": ""GET /images/home_bg_stars.gif HTTP/1.1"", ""status"": 200, ""size"": 2557}" "documents-211998.json","{""@timestamp"": 895873059, ""clientip"":""53.72.5.0"", ""request"": ""GET /images/home_tool.gif HTTP/1.0"", ""status"": 200, ""size"": 327}" ... clickhouse-client --query 'INSERT INTO http_logs Format CSVWithNames' < http_logs.csv
© 2021 Altinity, Inc. You can query using JSON* functions -- Get a JSON string value SELECT JSONExtractString(message, 'request') AS request FROM http_logs LIMIT 3 -- Get a JSON numeric value SELECT JSONExtractInt(message, 'status') AS status FROM http_logs LIMIT 3 -- Use values to answer useful questions. SELECT JSONExtractInt(message, 'status') AS status, count() as count FROM http_logs WHERE status >= 400 WHERE toDateTime(JSONExtractUInt32(message, '@timestamp') BETWEEN '1998-05-20 00:00:00' AND '1998-05-20 23:59:59' GROUP BY status ORDER BY status
© 2021 Altinity, Inc. -- Get using JSON function SELECT JSONExtractString(message, 'request') FROM http_logs LIMIT 3 -- Get it with proper type. SELECT visitParamExtractString(message, 'request') FROM http_logs LIMIT 3 JSON* vs visitParam functions SLOWER Complete JSON parser FASTER But cannot distinguish same name in different structures
© 2021 Altinity, Inc. We can improve usability by ordering data CREATE TABLE http_logs_sorted ( `file` String, `message` String, timestamp DateTime DEFAULT toDateTime(JSONExtractUInt(message, '@timestamp')) ) ENGINE = MergeTree PARTITION BY toStartOfMonth(timestamp) ORDER BY timestamp INSERT INTO http_logs_sorted SELECT file, message FROM http_logs
© 2021 Altinity, Inc. And still further by adding more columns ALTER TABLE http_logs_sorted ADD COLUMN `status` Int16 DEFAULT JSONExtractInt(message, 'status') CODEC(ZSTD(1)) ALTER TABLE http_logs_sorted ADD COLUMN `request` String DEFAULT JSONExtractString(message, 'request') -- Force columns to be materialized ALTER TABLE http_logs_sorted UPDATE status=status, request=request WHERE 1
© 2021 Altinity, Inc. Our query is now simpler... SELECT status, count() as count FROM http_logs_sorted WHERE status >= 400 AND timestamp BETWEEN '1998-05-20 00:00:00' AND '1998-05-20 23:59:59' GROUP BY status ORDER BY status
© 2021 Altinity, Inc. And MUCH faster! SELECT status, count() as count FROM http_logs_sorted WHERE status >= 400 AND timestamp BETWEEN '1998-05-20 00:00:00' AND '1998-05-20 23:59:59' GROUP BY status ORDER BY status 0.014 seconds vs 9.8 seconds! Can use primary key index to drop blocks 100x less I/O to read
© 2021 Altinity, Inc. Extracted columns are like indexes 16
© 2021 Altinity, Inc. (Here’s how we got that column size data) -- System tables are your friends SELECT table, name, data_compressed_bytes, formatReadableSize(data_compressed_bytes) AS tc, formatReadableSize(data_uncompressed_bytes) AS tu, data_compressed_bytes / data_uncompressed_bytes AS ratio, type, compression_codec FROM system.columns WHERE database = currentDatabase() AND table LIKE 'http%' ORDER BY table, name
© 2021 Altinity, Inc. Another way to store JSON objects: Maps CREATE TABLE http_logs_map ( `file` String, `message` Map(String, String), timestamp DateTime DEFAULT toDateTime(toUInt32(message['@timestamp'])) CODEC(Delta, ZSTD(1)) ) ENGINE = MergeTree PARTITION BY toStartOfMonth(timestamp) ORDER BY timestamp
© 2021 Altinity, Inc. Loading and querying JSON in Maps -- Load data INSERT into http_logs_map(file, message) SELECT file, JSONExtractKeysAndValues(message, 'String') message FROM http_logs -- Run a query. SELECT message['status'] status, count() FROM http_logs_map GROUP BY status ORDER BY status 4-5x faster than accessing JSON string objects
© 2021 Altinity, Inc. Storing JSON in paired arrays CREATE TABLE http_logs_arrays ( `file` String, `keys` Array(String), `values` Array(String), timestamp DateTime CODEC(Delta, ZSTD(1)) ) ENGINE = MergeTree PARTITION BY toStartOfMonth(timestamp) ORDER BY timestamp
© 2021 Altinity, Inc. Querying values in arrays -- Run a query. SELECT values[indexOf(keys, 'status')] status, count() FROM http_logs_arrays GROUP BY status ORDER BY status status|count() | ------|--------| 200 |24917090| 206 | 64935| 302 | 1941| 304 | 4899616| 400 | 888| 404 | 115005| 500 | 525| 4-5x faster than accessing JSON string objects
© 2021 Altinity, Inc. So... What do paired arrays look like? SELECT * FROM http_logs_arrays LIMIT 3 Row 1: ────── file: documents-211998.json keys: ['@timestamp','clientip','request','status','size'] values: ['895435201','30.20.6.0','GET /french/index.html HTTP/1.0','200','954'] timestamp: 1998-05-17 20:00:01 ... Key Value
© 2021 Altinity, Inc. Loading JSON to paired arrays -- Load data. Might be better to format outside ClickHouse. INSERT into http_logs_arrays(file, keys, values, timestamp) SELECT file, arrayMap(x -> x.1, JSONExtractKeysAndValues(message, 'String')) keys, arrayMap(x -> x.2, JSONExtractKeysAndValues(message, 'String')) values, toDateTime(JSONExtractUInt(message, '@timestamp')) timestamp FROM http_logs limit 30000000
© 2021 Altinity, Inc. Must we always copy between tables? 24 It seems painful. Surely there is a better way!
© 2021 Altinity, Inc. Use materialized views to help with loading Nginx Logs Web Logs http_logs_etl http_logs_arrays MV INSERT INTO Enrich data with materialized view
© 2021 Altinity, Inc. Create the base ETL table CREATE TABLE http_logs_etl ( `file` String, `message` String ) ENGINE = Null Does not store data but triggers materialized views
© 2021 Altinity, Inc. Create the target table (as before…) CREATE TABLE http_logs_arrays ( `file` String, `keys` Array(String), `values` Array(String), timestamp DateTime CODEC(Delta, ZSTD(1)) ) ENGINE = MergeTree PARTITION BY toStartOfMonth(timestamp) ORDER BY timestamp
© 2021 Altinity, Inc. Create materialized view for ETL -- Fires whenever a block arrives at http_logs_etl. CREATE MATERIALIZED VIEW http_logs_etl_mv TO http_logs_arrays AS SELECT file, arrayMap(x -> x.1, JSONExtractKeysAndValues(message, 'String')) keys, arrayMap(x -> x.2, JSONExtractKeysAndValues(message, 'String')) values, toDateTime(JSONExtractUInt(message, '@timestamp')) timestamp FROM http_logs_etl Add WHERE conditions to filter data Do any transforms you like!
© 2021 Altinity, Inc. Now load data and have at it... -- Clear the target table to start from scratch. TRUNCATE TABLE http_logs_arrays -- Load data via ETL table. INSERT INTO http_logs_etl(file, message) SELECT file, message FROM http_logs LIMIT 1000000 -- Confirm rows arrived in the arrays table. SELECT count() FROM http_logs_arrays count()| -------| 1000000| New data are instantly queryable
© 2021 Altinity, Inc. Roadmap and more information 30
© 2021 Altinity, Inc. ClickHouse JSON support is not perfect Complex JSON is hard to access! SELECT JSONExtractUInt( JSONExtractRaw( JSONExtractRaw( '{"a": {"b": {"c": 1}}}', 'a'), 'b'), 'c') AS anInt JSON performance is hard to optimize! 31
© 2021 Altinity, Inc. What’s coming to help? Semi-structured data! ClickHouse will map JSON structure to columnar storage and provide convenient query syntax https://github.com/ClickHouse/ClickHouse/issues/17623 32
© 2021 Altinity, Inc. You can go a long way in the meantime! Lots of options to represent JSON in ClickHouse tables ● Transform fully to columns ● Store JSON blob and add columns with indexes as needed ● Transform to Map ● Transform to paired arrays Load JSON blobs to staging table and transform with materialized views Mix and match to process JSON in a way that works best for your apps! 33 Fan Favorite
© 2021 Altinity, Inc. Questions? Thank you! Altinity https://altinity.com ClickHouse https://github.com/ClickH ouse/ClickHouse Altinity.Cloud https://altinity.com/cloud- database/ We are hiring! 34

Better than you think: Handling JSON data in ClickHouse

  • 1.
    Better Than YouThink JSON Data in ClickHouse Robert Hodges Altinity 1 V 2021-09
  • 2.
    © 2021 Altinity,Inc. Presenter and Company Bio www.altinity.com Enterprise provider for ClickHouse, a popular, open source data warehouse. We run Altinity.Cloud, the first managed ClickHouse in AWS and GCP Robert Hodges - Altinity CEO 30+ years on DBMS plus virtualization and security. Using ClickHouse since 2019. 2
  • 3.
    © 2021 Altinity,Inc. JSON is pervasive as raw data Web site access log in JSON format: { "@timestamp": 895873059, "clientip": "54.72.5.0", "request": "GET /images/home_bg_stars.gif HTTP/1.1", "status": 200, "size": 2557 }
  • 4.
    © 2021 Altinity,Inc. How to model JSON data in ClickHouse SQL Table Tabular: every key is a column SQL Table Array of Keys Arrays: Header values with key-value pairs Array of Values SQL Table JSON Blob JSON: Header values with JSON string (“blob”)
  • 5.
    © 2021 Altinity,Inc. Simplest design maps all keys to columns CREATE TABLE http_logs_tabular ( `@timestamp` DateTime, `clientip` IPv4, `status` UInt16, `request` String, `size` UInt32 ) ENGINE = MergeTree PARTITION BY toStartOfDay(`@timestamp`) ORDER BY `@timestamp` SETTINGS index_granularity = 8192
  • 6.
    © 2021 Altinity,Inc. Loading *might* be this easy head http_logs.json {"@timestamp": 895873059, "clientip":"54.72.5.0", "request": "GET /images/home_bg_stars.gif HTTP/1.1", "status": 200, "size": 2557} {"@timestamp": 895873059, "clientip":"53.72.5.0", "request": "GET /images/home_tool.gif HTTP/1.0", "status": 200, "size": 327} ... clickhouse-client --query 'INSERT INTO http_logs_tabular Format JSONEachRow' < http_logs_tabular
  • 7.
    © 2021 Altinity,Inc. But what happens if... You don’t know the columns up front? JSON keys and values differ between records? JSON values are dirty or don’t map trivially to columns? 7 How can I handle that in ClickHouse??
  • 8.
    © 2021 Altinity,Inc. Start by storing the JSON as text CREATE TABLE http_logs ( `file` String, `message` String ) ENGINE = MergeTree PARTITION BY file ORDER BY tuple() SETTINGS index_granularity = 8192 “Blob”
  • 9.
    © 2021 Altinity,Inc. Load data in whatever way is easiest... head http_logs.csv "file","message" "documents-211998.json","{""@timestamp"": 895873059, ""clientip"":""54.72.5.0"", ""request"": ""GET /images/home_bg_stars.gif HTTP/1.1"", ""status"": 200, ""size"": 2557}" "documents-211998.json","{""@timestamp"": 895873059, ""clientip"":""53.72.5.0"", ""request"": ""GET /images/home_tool.gif HTTP/1.0"", ""status"": 200, ""size"": 327}" ... clickhouse-client --query 'INSERT INTO http_logs Format CSVWithNames' < http_logs.csv
  • 10.
    © 2021 Altinity,Inc. You can query using JSON* functions -- Get a JSON string value SELECT JSONExtractString(message, 'request') AS request FROM http_logs LIMIT 3 -- Get a JSON numeric value SELECT JSONExtractInt(message, 'status') AS status FROM http_logs LIMIT 3 -- Use values to answer useful questions. SELECT JSONExtractInt(message, 'status') AS status, count() as count FROM http_logs WHERE status >= 400 WHERE toDateTime(JSONExtractUInt32(message, '@timestamp') BETWEEN '1998-05-20 00:00:00' AND '1998-05-20 23:59:59' GROUP BY status ORDER BY status
  • 11.
    © 2021 Altinity,Inc. -- Get using JSON function SELECT JSONExtractString(message, 'request') FROM http_logs LIMIT 3 -- Get it with proper type. SELECT visitParamExtractString(message, 'request') FROM http_logs LIMIT 3 JSON* vs visitParam functions SLOWER Complete JSON parser FASTER But cannot distinguish same name in different structures
  • 12.
    © 2021 Altinity,Inc. We can improve usability by ordering data CREATE TABLE http_logs_sorted ( `file` String, `message` String, timestamp DateTime DEFAULT toDateTime(JSONExtractUInt(message, '@timestamp')) ) ENGINE = MergeTree PARTITION BY toStartOfMonth(timestamp) ORDER BY timestamp INSERT INTO http_logs_sorted SELECT file, message FROM http_logs
  • 13.
    © 2021 Altinity,Inc. And still further by adding more columns ALTER TABLE http_logs_sorted ADD COLUMN `status` Int16 DEFAULT JSONExtractInt(message, 'status') CODEC(ZSTD(1)) ALTER TABLE http_logs_sorted ADD COLUMN `request` String DEFAULT JSONExtractString(message, 'request') -- Force columns to be materialized ALTER TABLE http_logs_sorted UPDATE status=status, request=request WHERE 1
  • 14.
    © 2021 Altinity,Inc. Our query is now simpler... SELECT status, count() as count FROM http_logs_sorted WHERE status >= 400 AND timestamp BETWEEN '1998-05-20 00:00:00' AND '1998-05-20 23:59:59' GROUP BY status ORDER BY status
  • 15.
    © 2021 Altinity,Inc. And MUCH faster! SELECT status, count() as count FROM http_logs_sorted WHERE status >= 400 AND timestamp BETWEEN '1998-05-20 00:00:00' AND '1998-05-20 23:59:59' GROUP BY status ORDER BY status 0.014 seconds vs 9.8 seconds! Can use primary key index to drop blocks 100x less I/O to read
  • 16.
    © 2021 Altinity,Inc. Extracted columns are like indexes 16
  • 17.
    © 2021 Altinity,Inc. (Here’s how we got that column size data) -- System tables are your friends SELECT table, name, data_compressed_bytes, formatReadableSize(data_compressed_bytes) AS tc, formatReadableSize(data_uncompressed_bytes) AS tu, data_compressed_bytes / data_uncompressed_bytes AS ratio, type, compression_codec FROM system.columns WHERE database = currentDatabase() AND table LIKE 'http%' ORDER BY table, name
  • 18.
    © 2021 Altinity,Inc. Another way to store JSON objects: Maps CREATE TABLE http_logs_map ( `file` String, `message` Map(String, String), timestamp DateTime DEFAULT toDateTime(toUInt32(message['@timestamp'])) CODEC(Delta, ZSTD(1)) ) ENGINE = MergeTree PARTITION BY toStartOfMonth(timestamp) ORDER BY timestamp
  • 19.
    © 2021 Altinity,Inc. Loading and querying JSON in Maps -- Load data INSERT into http_logs_map(file, message) SELECT file, JSONExtractKeysAndValues(message, 'String') message FROM http_logs -- Run a query. SELECT message['status'] status, count() FROM http_logs_map GROUP BY status ORDER BY status 4-5x faster than accessing JSON string objects
  • 20.
    © 2021 Altinity,Inc. Storing JSON in paired arrays CREATE TABLE http_logs_arrays ( `file` String, `keys` Array(String), `values` Array(String), timestamp DateTime CODEC(Delta, ZSTD(1)) ) ENGINE = MergeTree PARTITION BY toStartOfMonth(timestamp) ORDER BY timestamp
  • 21.
    © 2021 Altinity,Inc. Querying values in arrays -- Run a query. SELECT values[indexOf(keys, 'status')] status, count() FROM http_logs_arrays GROUP BY status ORDER BY status status|count() | ------|--------| 200 |24917090| 206 | 64935| 302 | 1941| 304 | 4899616| 400 | 888| 404 | 115005| 500 | 525| 4-5x faster than accessing JSON string objects
  • 22.
    © 2021 Altinity,Inc. So... What do paired arrays look like? SELECT * FROM http_logs_arrays LIMIT 3 Row 1: ────── file: documents-211998.json keys: ['@timestamp','clientip','request','status','size'] values: ['895435201','30.20.6.0','GET /french/index.html HTTP/1.0','200','954'] timestamp: 1998-05-17 20:00:01 ... Key Value
  • 23.
    © 2021 Altinity,Inc. Loading JSON to paired arrays -- Load data. Might be better to format outside ClickHouse. INSERT into http_logs_arrays(file, keys, values, timestamp) SELECT file, arrayMap(x -> x.1, JSONExtractKeysAndValues(message, 'String')) keys, arrayMap(x -> x.2, JSONExtractKeysAndValues(message, 'String')) values, toDateTime(JSONExtractUInt(message, '@timestamp')) timestamp FROM http_logs limit 30000000
  • 24.
    © 2021 Altinity,Inc. Must we always copy between tables? 24 It seems painful. Surely there is a better way!
  • 25.
    © 2021 Altinity,Inc. Use materialized views to help with loading Nginx Logs Web Logs http_logs_etl http_logs_arrays MV INSERT INTO Enrich data with materialized view
  • 26.
    © 2021 Altinity,Inc. Create the base ETL table CREATE TABLE http_logs_etl ( `file` String, `message` String ) ENGINE = Null Does not store data but triggers materialized views
  • 27.
    © 2021 Altinity,Inc. Create the target table (as before…) CREATE TABLE http_logs_arrays ( `file` String, `keys` Array(String), `values` Array(String), timestamp DateTime CODEC(Delta, ZSTD(1)) ) ENGINE = MergeTree PARTITION BY toStartOfMonth(timestamp) ORDER BY timestamp
  • 28.
    © 2021 Altinity,Inc. Create materialized view for ETL -- Fires whenever a block arrives at http_logs_etl. CREATE MATERIALIZED VIEW http_logs_etl_mv TO http_logs_arrays AS SELECT file, arrayMap(x -> x.1, JSONExtractKeysAndValues(message, 'String')) keys, arrayMap(x -> x.2, JSONExtractKeysAndValues(message, 'String')) values, toDateTime(JSONExtractUInt(message, '@timestamp')) timestamp FROM http_logs_etl Add WHERE conditions to filter data Do any transforms you like!
  • 29.
    © 2021 Altinity,Inc. Now load data and have at it... -- Clear the target table to start from scratch. TRUNCATE TABLE http_logs_arrays -- Load data via ETL table. INSERT INTO http_logs_etl(file, message) SELECT file, message FROM http_logs LIMIT 1000000 -- Confirm rows arrived in the arrays table. SELECT count() FROM http_logs_arrays count()| -------| 1000000| New data are instantly queryable
  • 30.
    © 2021 Altinity,Inc. Roadmap and more information 30
  • 31.
    © 2021 Altinity,Inc. ClickHouse JSON support is not perfect Complex JSON is hard to access! SELECT JSONExtractUInt( JSONExtractRaw( JSONExtractRaw( '{"a": {"b": {"c": 1}}}', 'a'), 'b'), 'c') AS anInt JSON performance is hard to optimize! 31
  • 32.
    © 2021 Altinity,Inc. What’s coming to help? Semi-structured data! ClickHouse will map JSON structure to columnar storage and provide convenient query syntax https://github.com/ClickHouse/ClickHouse/issues/17623 32
  • 33.
    © 2021 Altinity,Inc. You can go a long way in the meantime! Lots of options to represent JSON in ClickHouse tables ● Transform fully to columns ● Store JSON blob and add columns with indexes as needed ● Transform to Map ● Transform to paired arrays Load JSON blobs to staging table and transform with materialized views Mix and match to process JSON in a way that works best for your apps! 33 Fan Favorite
  • 34.
    © 2021 Altinity,Inc. Questions? Thank you! Altinity https://altinity.com ClickHouse https://github.com/ClickH ouse/ClickHouse Altinity.Cloud https://altinity.com/cloud- database/ We are hiring! 34