1Confidential KSQL The Open Source Streaming SQL Engine for Apache Kafka Kai Waehner Technology Evangelist kontakt@kai-waehner.de LinkedIn @KaiWaehner www.confluent.io www.kai-waehner.de
KSQLis the Streaming SQL Engine for Apache Kafka
3Confidential 1.0 Enterprise Ready J A Brief History of Apache Kafka and Confluent 0.11 Exactly-once semantics 0.10 Data processing (Streams API) 0.9 Data integration (Connect API) Intra-cluster replication 0.8 2012 2014 Cluster mirroring0.7 2015 2016 20172013 2018 CP 4.1 KSQL GA
4Confidential Agenda – KSQL Deep Dive 1) Apache Kafka Ecosystem 2) Kafka Streams as Foundation for KSQL 3) Motivation for KSQL 4) KSQL Concepts 5) Live Demo #1 – Intro to KSQL 6) KSQL Architecture 7) Live Demo #2 - Clickstream Analysis 8) Building a User Defined Function (Example: Machine Learning) 9) Getting Started
5KSQL- Streaming SQL for Apache Kafka Agenda – KSQL Deep Dive 1) Apache Kafka Ecosystem 2) Kafka Streams as Foundation for KSQL 3) Motivation for KSQL 4) KSQL Concepts 5) Live Demo #1 – Intro to KSQL 6) KSQL Architecture 7) Live Demo #2 - Clickstream Analysis 8) Building a User Defined Function (Example: Machine Learning) 9) Getting Started
6KSQL- Streaming SQL for Apache Kafka Apache Kafka - A Distributed, Scalable Commit Log
7KSQL- Streaming SQL for Apache Kafka Anatomy of a Topic
8KSQL- Streaming SQL for Apache Kafka Apache Kafka at Large Scale à No need to do a POC https://conferences.oreilly.com/strata/strata-ca/public/schedule/detail/63921 https://qconlondon.com/london2018/presentation/cloud-native-and-scalable-kafka-architecture (2018) (2018)
9KSQL- Streaming SQL for Apache Kafka The Log ConnectorsConnectors Producer Consumer Streaming Engine Apache Kafka – The Rise of a Streaming Platform
10KSQL- Streaming SQL for Apache Kafka Apache Kafka – The Rise of a Streaming Platform
11KSQL- Streaming SQL for Apache Kafka KSQL – The Streaming SQL Engine for Apache Kafka
12KSQL- Streaming SQL for Apache Kafka Agenda – KSQL Deep Dive 1) Apache Kafka Ecosystem 2) Kafka Streams as Foundation for KSQL 3) Motivation for KSQL 4) KSQL Concepts 5) Live Demo #1 – Intro to KSQL 6) KSQL Architecture 7) Live Demo #2 - Clickstream Analysis 8) Building a User Defined Function (Example: Machine Learning) 9) Getting Started
13KSQL- Streaming SQL for Apache Kafka Kafka Streams - Part of Apache Kafka
14KSQL- Streaming SQL for Apache Kafka Stream Processing Data at Rest Data in Motion
15KSQL- Streaming SQL for Apache Kafka Key concepts
16KSQL- Streaming SQL for Apache Kafka Independent Dev / Test / Prod of independent Apps
17KSQL- Streaming SQL for Apache Kafka No Matter Where it Runs
18KSQL- Streaming SQL for Apache Kafka Kafka Streams - Processor Topology Read input from Kafka Operator DAG: • Filter / map / aggregation / joins • Operators can be stateful Write result back to Kafka
19KSQL- Streaming SQL for Apache Kafka Kafka Streams - Processor Topology
20KSQL- Streaming SQL for Apache Kafka Kafka Streams - Runtime
21KSQL- Streaming SQL for Apache Kafka Kafka Streams - Distributed State
22KSQL- Streaming SQL for Apache Kafka Kafka Streams - Scaling
23KSQL- Streaming SQL for Apache Kafka Pluggable State Store. Default Strategy: - In-memory (fast access) - Local disc (for fast recovery) - Replicated to Kafka (for resilience) https://www.infoq.com/presentations/kafka-streams-spring-cloud State Management
24KSQL- Streaming SQL for Apache Kafka Kafka Streams - Streams and Tables
25KSQL- Streaming SQL for Apache Kafka Kafka Streams - Streams and Tables
26KSQL- Streaming SQL for Apache Kafka // Example: reading data from Kafka KStream<byte[], String> textLines = builder.stream("textlines-topic", Consumed.with( Serdes.ByteArray(), Serdes.String())); // Example: transforming data KStream<byte[], String> upperCasedLines= rawRatings.mapValues(String::toUpperCase)); KStream
27KSQL- Streaming SQL for Apache Kafka // Example: aggregating data KTable<String, Long> wordCounts = textLines .flatMapValues(textLine -> Arrays.asList(textLine.toLowerCase().split("W+"))) .groupBy((key, word) -> word) .count(); KTable
28KSQL- Streaming SQL for Apache Kafka Kafka Streams A complete streaming microservices, ready for production at large-scale App configuration Define processing (here: WordCount) Start processing
29KSQL- Streaming SQL for Apache Kafka Agenda – KSQL Deep Dive 1) Apache Kafka Ecosystem 2) Kafka Streams as Foundation for KSQL 3) Motivation for KSQL 4) KSQL Concepts 5) Live Demo #1 – Intro to KSQL 6) KSQL Architecture 7) Live Demo #2 - Clickstream Analysis 8) Building a User Defined Function (Example: Machine Learning) 9) Getting Started
30KSQL- Streaming SQL for Apache Kafka Why KSQL? Population CodingSophistication Realm of Stream Processing New, Expanded Realm BI Analysts Core Developers Data Engineers Core Developers who don’t like Java Kafka Streams KSQL
31KSQL- Streaming SQL for Apache Kafka Trade-Offs • subscribe() • poll() • send() • flush() • mapValues() • filter() • punctuate() • Select…from… • Join…where… • Group by.. Flexibility Simplicity Kafka Streams KSQL Consumer Producer
32KSQL- Streaming SQL for Apache Kafka What is it for ? Streaming ETL • Kafka is popular for data pipelines • KSQL enables easy transformations of data within the pipe CREATE STREAM vip_actions AS SELECT userid, page, action FROM clickstream c LEFT JOIN users u ON c.userid = u.user_id WHERE u.level = 'Platinum';
33KSQL- Streaming SQL for Apache Kafka What is it for ? Analytics, e.g. Anomaly Detection • Identifying patterns or anomalies in real-time data, surfaced in milliseconds CREATE TABLE possible_fraud AS SELECT card_number, count(*) FROM authorization_attempts WINDOW TUMBLING (SIZE 5 MINUTES) GROUP BY card_number HAVING count(*) > 3;
34KSQL- Streaming SQL for Apache Kafka What is it for ? Real Time Monitoring • Log data monitoring, tracking and alerting • Sensor / IoT data CREATE TABLE error_counts AS SELECT error_code, count(*) FROM monitoring_stream WINDOW TUMBLING (SIZE 1 MINUTE) WHERE type = 'ERROR' GROUP BY error_code;
35KSQL- Streaming SQL for Apache Kafka What is it for ? Simple Derivations of Existing Topics • One-liner to re-partition and / or re-key a topic for new uses CREATE STREAM views_by_userid WITH (PARTITIONS=6, VALUE_FORMAT=‘JSON’, TIMESTAMP=‘view_time’) AS SELECT * FROM clickstream PARTITION BY user_id;
36KSQL- Streaming SQL for Apache Kafka What is it for ? Easily Convert between Data Formats in Real Time • JSON-to-Avro conversion CREATE STREAM sensor_events_json (sensor_id VARCHAR, temperature INTEGER, ...) WITH (KAFKA_TOPIC='events-topic', VALUE_FORMAT='JSON'); CREATE STREAM sensor_events_avro WITH (VALUE_FORMAT='AVRO') AS SELECT * FROM sensor_events_json;
37KSQL- Streaming SQL for Apache Kafka Where is KSQL not such a great fit (at least not yet)? Powerful ad-hoc query ○ Limited span of time usually retained in Kafka ○ No indexes BI reports (Tableau etc.) ○ No indexes ○ No JDBC (most Bi tools are not good with continuous results!) Keep in mind: Kafka is a streaming platform!
38KSQL- Streaming SQL for Apache Kafka Agenda – KSQL Deep Dive 1) Apache Kafka Ecosystem 2) Kafka Streams as Foundation for KSQL 3) Motivation for KSQL 4) KSQL Concepts 5) Live Demo #1 – Intro to KSQL 6) KSQL Architecture 7) Live Demo #2 - Clickstream Analysis 8) Building a User Defined Function (Example: Machine Learning) 9) Getting Started
39KSQL- Streaming SQL for Apache Kafka KSQL – A Streaming SQL Engine for Apache Kafka
40KSQL- Streaming SQL for Apache Kafka KSQL is Equally viable for S / M / L / XL / XXL use cases Ok. Ok. Ok. … and KSQL is ready for production, including 24/7 support!
41KSQL- Streaming SQL for Apache Kafka KSQL is Equally viable for S / M / L / XL / XXL use cases
42KSQL- Streaming SQL for Apache Kafka How do you deploy applications?
43KSQL- Streaming SQL for Apache Kafka Where to develop and operate your applications?
44KSQL- Streaming SQL for Apache Kafka KSQL Concepts ● No need for source code • Zero, none at all, not even one line. • No SerDes, no generics, no lambdas, ... ● All the Kafka Streams “magic” out-of-the-box • Exactly Once Semantics • Windowing • Event-time aggregation • Late-arriving data • Distributed, fault-tolerant, scalable, ...
45KSQL- Streaming SQL for Apache Kafka STREAM and TABLE as first-class citizens
46KSQL- Streaming SQL for Apache Kafka SELECT statement syntax SELECT `select_expr` [, ...] FROM `from_item` [, ...] [ WINDOW `window_expression` ] [ WHERE `condition` ] [ GROUP BY `grouping expression` ] [ HAVING `having_expression` ] [ LIMIT n ] where from_item is one of the following: stream_or_table_name [ [ AS ] alias] from_item LEFT JOIN from_item ON join_condition
47KSQL- Streaming SQL for Apache Kafka CREATE STREAM AS syntax CREATE STREAM `stream_name` [WITH (`property = expression` [, …] ) ] AS SELECT `select_expr` [, ...] FROM `from_item` [, ...] [ WHERE `condition` ] [ PARTITION BY `column_name` ] ● where property can be any of the following: KAFKA_TOPIC = name - what to call the sink topic FORMAT = DELIMITED | JSON | AVRO - defaults to format of input stream AVROSCHEMAFILE = path/to/file - if FORMAT=AVRO, where the output schema file will be written to PARTITIONS = # - number of partitions in sink topic TIMESTAMP = column - The name of the column to use as the timestamp. This can be used to define the event time.
48KSQL- Streaming SQL for Apache Kafka CREATE TABLE AS syntax CREATE TABLE `stream_name` [WITH ( `property_name = expression` [, ...] )] AS SELECT `select_expr` [, ...] FROM `from_item` [, ...] [ WINDOW `window_expression` ] [ WHERE `condition` ] [ GROUP BY `grouping expression` ] [ HAVING `having_expression` ] ● where property values are same as for ‚Create Streams as Select‘
49KSQL- Streaming SQL for Apache Kafka Automatic Inference of Topic Schema (leveraging Confluent Schema Registry) https://www.matthowlett.com/2017-12-23-exploring-wikipedia-ksql.html
50KSQL- Streaming SQL for Apache Kafka WINDOWing ● Not ANSI SQL ! à Continuous Queries ● Three types supported (same as Kafka Streams): • TUMBLING (repeats at a non-overlapping interval) • SELECT appname, ip, COUNT(appname) AS problem_count FROM logstream WINDOW TUMBLING (size 1 minute) WHERE loglevel='ERROR' GROUP BY appname, ip; • HOPPING (similar to tumbling, but hopping generally has an overlapping interval) • SELECT itemid, SUM(arraycol[0]) FROM orders WINDOW HOPPING ( size 20 second, advance by 5 second) GROUP BY itemid; • SESSION (groups elements by sessions of activity, do not overlap and do not have a fixed start and end time, closes when it does not receive elements for a certain period of time, i.e., when a gap of inactivity occurred) • SELECT itemid, SUM(sales_price) FROM orders WINDOW SESSION (20 second) GROUP BY itemid;
51KSQL- Streaming SQL for Apache Kafka Agenda – KSQL Deep Dive 1) Apache Kafka Ecosystem 2) Kafka Streams as Foundation for KSQL 3) Motivation for KSQL 4) KSQL Concepts 5) Live Demo #1 – Intro to KSQL 6) KSQL Architecture 7) Live Demo #2 - Clickstream Analysis 8) Building a User Defined Function (Example: Machine Learning) 9) Getting Started
52KSQL- Streaming SQL for Apache Kafka KSQL CLI ksql> CREATE STREAM pageviews_original (viewtime bigint, userid varchar, pageid varchar) WITH (kafka_topic='pageviews', value_format='DELIMITED'); ksql> CREATE TABLE users_original (registertime bigint, gender varchar, regionid varchar, userid varchar) WITH (kafka_topic='users', value_format='JSON'); ksql> SELECT pageid FROM pageviews_original LIMIT 10; ksql> CREATE STREAM pageviews_female AS SELECT users_original.userid AS userid, pageid, regionid, gender FROM pageviews_original LEFT JOIN users_original ON pageviews_original.userid = users_original.userid WHERE gender = 'FEMALE'; ksql/bin/ksql-server-start ksql/bin/ksql
53KSQL- Streaming SQL for Apache Kafka KSQL Web UI
54KSQL- Streaming SQL for Apache Kafka Live Demo – KSQL Getting Started
55KSQL- Streaming SQL for Apache Kafka Break
56KSQL- Streaming SQL for Apache Kafka Agenda – KSQL Deep Dive 1) Apache Kafka Ecosystem 2) Kafka Streams as Foundation for KSQL 3) Motivation for KSQL 4) KSQL Concepts 5) Live Demo #1 – Intro to KSQL 6) KSQL Architecture 7) Live Demo #2 - Clickstream Analysis 8) Building a User Defined Function (Example: Machine Learning) 9) Getting Started
57KSQL- Streaming SQL for Apache Kafka KSQL - Components KSQL has 3 main components: 1. The Engine which actually runs the Kafka Streams topologies 2. The REST server interface enables an Engine to receive instructions from the CLI or any other client 3. The CLI, designed to be familiar to users of MySQL, Postgres etc. (Note that you also need a Kafka Cluster… KSQL is deployed independently)
58KSQL- Streaming SQL for Apache Kafka How to run KSQL è #1 Client – Server (Interactive Mode) JVM KSQL Server KSQL CLI / any REST Client JVM KSQL Server JVM KSQL Server Kafka Cluster
59KSQL- Streaming SQL for Apache Kafka How to run KSQL è #1 Client – Server (Interactive Mode) • Start any number of server nodes bin/ksql-server-start • Start one or more CLIs or REST Clients and point them to a server bin/ksql https://myksqlserver:8090 • All servers share the processing load Technically, instances of the same Kafka Streams Applications scale up / down without restart
60KSQL- Streaming SQL for Apache Kafka How to run KSQL è #2 as Standalone Application (Headless Mode) JVM KSQL Server JVM KSQL Server JVM KSQL Server Kafka Cluster
61KSQL- Streaming SQL for Apache Kafka How to run KSQL è #2 as Standalone Application (Headless Mode) • Start any number of server nodes Pass a file of KSQL statement to execute bin/ksql-node query-file=foo/bar.sql • Ideal for streaming ETL application deployment Version-control your queries and transformations as code • All running engines share the processing load Technically, instances of the same Kafka Streams Applications scale up / down without restart
62KSQL- Streaming SQL for Apache Kafka How to run KSQL è #3 Embedded in an Application (JVM Mode) JVM App Instance KSQL Engine Application Code JVM App Instance KSQL Engine Application Code JVM App Instance KSQL Engine Application Code Kafka Cluster
63KSQL- Streaming SQL for Apache Kafka How to run KSQL è #3 Embedded in an Application (JVM Mode) • Embed directly in your Java application • Generate and execute KSQL queries through the Java API Version-control your queries and transformations as code • All running application instances share the processing load Technically, instances of the same Kafka Streams Applications scale up / down without restart
64KSQL- Streaming SQL for Apache Kafka Dedicating resources Join Engines to the same ‘service pool’ by means of the ksql.service.id property
65KSQL- Streaming SQL for Apache Kafka Sizing of KSQL Servers 3 Categories of KSQL Queries • Project / Filter, e.g. SELECT <columns> FROM <table/stream> WHERE <condition> • Joins, e.g. Stream-Table Joins • Aggregations, e.g. SUM, COUNT, TOPK, TOPKDISTINCT Sizing Questions • How much Memory, CPU, Disk, Network? • When and how to scale up / down? • How to tune performance? • Etc… Guide for Capacity Planning https://docs.confluent.io/current/ksql/docs/ capacity-planning.html https://docs.confluent.io/current/streams/ sizing.html#streams-sizing
66KSQL- Streaming SQL for Apache Kafka Agenda – KSQL Deep Dive 1) Apache Kafka Ecosystem 2) Kafka Streams as Foundation for KSQL 3) Motivation for KSQL 4) KSQL Concepts 5) Live Demo #1 – Intro to KSQL 6) KSQL Architecture 7) Live Demo #2 - Clickstream Analysis 8) Building a User Defined Function (Example: Machine Learning) 9) Getting Started
67KSQL- Streaming SQL for Apache Kafka Demo: Clickstream Analysis Kafka Producer Elastic search Grafana Kafka Cluster Kafka Connect KSQL Stream of Log Events Kafka Ecosystem Other Components
68KSQL- Streaming SQL for Apache Kafka Demo - Clickstream Analysis • https://docs.confluent.io/current/ksql/docs/tutorials/clickstream-docker.html#ksql-clickstream- docker • Leverages Apache Kafka, Kafka Connect, KSQL, Elasticsearch and Grafana • 5min screencast: https://www.youtube.com/watch?v=A45uRzJiv7I • Setup in 5 minutes (with or without Docker) SELECT STREAM CEIL(timestamp TO HOUR) AS timeWindow, productId, COUNT(*) AS hourlyOrders, SUM(units) AS units FROM Orders GROUP BY CEIL(timestamp TO HOUR), productId; timeWindow | productId | hourlyOrders | units ------------+-----------+--------------+------- 08:00:00 | 10 | 2 | 5 08:00:00 | 20 | 1 | 8 09:00:00 | 10 | 4 | 22 09:00:00 | 40 | 1 | 45 ... | ... | ... | ...
69KSQL- Streaming SQL for Apache Kafka Live Demo – KSQL Clickstream Analysis
70KSQL- Streaming SQL for Apache Kafka Agenda – KSQL Deep Dive 1) Apache Kafka Ecosystem 2) Kafka Streams as Foundation for KSQL 3) Motivation for KSQL 4) KSQL Concepts 5) Live Demo #1 – Intro to KSQL 6) KSQL Architecture 7) Live Demo #2 - Clickstream Analysis 8) Building a User Defined Function (Example: Machine Learning) 9) Getting Started
71KSQL- Streaming SQL for Apache Kafka Advanced Use Case - Machine Learning UDF for Real Time Sensor Analytics Kafka Producer Elastic search Grafana Kafka Cluster Kafka Connect KSQL Heath Check Sensor Kafka Ecosystem Other Components Emergency System All Data Apply Analytic Model Filter Predictions
72KSQL- Streaming SQL for Apache Kafka KSQL and Deep Learning (Autoencoder) for IoT Sensor Analytics https://www.confluent.io/blog/write-user-defined-function-udf-ksql/ https://github.com/kaiwaehner/ksql-machine-learning-udf “SELECT event_id, anomaly(SENSORINPUT) FROM health_sensor;“ KSQL UDF using an analytic model under the hood à Write once, use in any KSQL statement
73KSQL- Streaming SQL for Apache Kafka KSQL UDF https://www.confluent.io/blog/write-user-defined-function-udf-ksql/ https://github.com/kaiwaehner/ksql-machine-learning-udf
74KSQL- Streaming SQL for Apache Kafka Use Case: Anomaly Detection (Sensor Healthcheck) Machine Learning Algorithm: Autoencoder built with H2O Streaming Platform: Apache Kafka and KSQL Live Demo – Prebuilt Model Embedded in KSQL Function
75KSQL- Streaming SQL for Apache Kafka Agenda – KSQL Deep Dive 1) Apache Kafka Ecosystem 2) Kafka Streams as Foundation for KSQL 3) Motivation for KSQL 4) KSQL Concepts 5) Live Demo #1 – Intro to KSQL 6) KSQL Architecture 7) Live Demo #2 - Clickstream Analysis 8) Building a User Defined Function (Example: Machine Learning) 9) Getting Started
76KSQL- Streaming SQL for Apache Kafka KSQL Quick Start github.com/confluentinc/ksql Local runtime or Docker container
77KSQL- Streaming SQL for Apache Kafka Resources and Next Steps Get Involved • Try the Quickstart on GitHub • Check out the code • Play with the examples KSQL is GA… You can already use it for production deployments! https://github.com/confluentinc/ksql http://confluent.io/ksql https://slackpass.io/confluentcommunity #ksql
KSQLis the Streaming SQL Engine for Apache Kafka
79KSQL- Streaming SQL for Apache Kafka Kai Waehner Technology Evangelist kontakt@kai-waehner.de @KaiWaehner www.confluent.io www.kai-waehner.de LinkedIn Questions? Feedback? Please contact me…

KSQL Deep Dive - The Open Source Streaming Engine for Apache Kafka

  • 1.
    1Confidential KSQL The Open SourceStreaming SQL Engine for Apache Kafka Kai Waehner Technology Evangelist kontakt@kai-waehner.de LinkedIn @KaiWaehner www.confluent.io www.kai-waehner.de
  • 2.
  • 3.
    3Confidential 1.0 Enterprise Ready J ABrief History of Apache Kafka and Confluent 0.11 Exactly-once semantics 0.10 Data processing (Streams API) 0.9 Data integration (Connect API) Intra-cluster replication 0.8 2012 2014 Cluster mirroring0.7 2015 2016 20172013 2018 CP 4.1 KSQL GA
  • 4.
    4Confidential Agenda – KSQLDeep Dive 1) Apache Kafka Ecosystem 2) Kafka Streams as Foundation for KSQL 3) Motivation for KSQL 4) KSQL Concepts 5) Live Demo #1 – Intro to KSQL 6) KSQL Architecture 7) Live Demo #2 - Clickstream Analysis 8) Building a User Defined Function (Example: Machine Learning) 9) Getting Started
  • 5.
    5KSQL- Streaming SQLfor Apache Kafka Agenda – KSQL Deep Dive 1) Apache Kafka Ecosystem 2) Kafka Streams as Foundation for KSQL 3) Motivation for KSQL 4) KSQL Concepts 5) Live Demo #1 – Intro to KSQL 6) KSQL Architecture 7) Live Demo #2 - Clickstream Analysis 8) Building a User Defined Function (Example: Machine Learning) 9) Getting Started
  • 6.
    6KSQL- Streaming SQLfor Apache Kafka Apache Kafka - A Distributed, Scalable Commit Log
  • 7.
    7KSQL- Streaming SQLfor Apache Kafka Anatomy of a Topic
  • 8.
    8KSQL- Streaming SQLfor Apache Kafka Apache Kafka at Large Scale à No need to do a POC https://conferences.oreilly.com/strata/strata-ca/public/schedule/detail/63921 https://qconlondon.com/london2018/presentation/cloud-native-and-scalable-kafka-architecture (2018) (2018)
  • 9.
    9KSQL- Streaming SQLfor Apache Kafka The Log ConnectorsConnectors Producer Consumer Streaming Engine Apache Kafka – The Rise of a Streaming Platform
  • 10.
    10KSQL- Streaming SQLfor Apache Kafka Apache Kafka – The Rise of a Streaming Platform
  • 11.
    11KSQL- Streaming SQLfor Apache Kafka KSQL – The Streaming SQL Engine for Apache Kafka
  • 12.
    12KSQL- Streaming SQLfor Apache Kafka Agenda – KSQL Deep Dive 1) Apache Kafka Ecosystem 2) Kafka Streams as Foundation for KSQL 3) Motivation for KSQL 4) KSQL Concepts 5) Live Demo #1 – Intro to KSQL 6) KSQL Architecture 7) Live Demo #2 - Clickstream Analysis 8) Building a User Defined Function (Example: Machine Learning) 9) Getting Started
  • 13.
    13KSQL- Streaming SQLfor Apache Kafka Kafka Streams - Part of Apache Kafka
  • 14.
    14KSQL- Streaming SQLfor Apache Kafka Stream Processing Data at Rest Data in Motion
  • 15.
    15KSQL- Streaming SQLfor Apache Kafka Key concepts
  • 16.
    16KSQL- Streaming SQLfor Apache Kafka Independent Dev / Test / Prod of independent Apps
  • 17.
    17KSQL- Streaming SQLfor Apache Kafka No Matter Where it Runs
  • 18.
    18KSQL- Streaming SQLfor Apache Kafka Kafka Streams - Processor Topology Read input from Kafka Operator DAG: • Filter / map / aggregation / joins • Operators can be stateful Write result back to Kafka
  • 19.
    19KSQL- Streaming SQLfor Apache Kafka Kafka Streams - Processor Topology
  • 20.
    20KSQL- Streaming SQLfor Apache Kafka Kafka Streams - Runtime
  • 21.
    21KSQL- Streaming SQLfor Apache Kafka Kafka Streams - Distributed State
  • 22.
    22KSQL- Streaming SQLfor Apache Kafka Kafka Streams - Scaling
  • 23.
    23KSQL- Streaming SQLfor Apache Kafka Pluggable State Store. Default Strategy: - In-memory (fast access) - Local disc (for fast recovery) - Replicated to Kafka (for resilience) https://www.infoq.com/presentations/kafka-streams-spring-cloud State Management
  • 24.
    24KSQL- Streaming SQLfor Apache Kafka Kafka Streams - Streams and Tables
  • 25.
    25KSQL- Streaming SQLfor Apache Kafka Kafka Streams - Streams and Tables
  • 26.
    26KSQL- Streaming SQLfor Apache Kafka // Example: reading data from Kafka KStream<byte[], String> textLines = builder.stream("textlines-topic", Consumed.with( Serdes.ByteArray(), Serdes.String())); // Example: transforming data KStream<byte[], String> upperCasedLines= rawRatings.mapValues(String::toUpperCase)); KStream
  • 27.
    27KSQL- Streaming SQLfor Apache Kafka // Example: aggregating data KTable<String, Long> wordCounts = textLines .flatMapValues(textLine -> Arrays.asList(textLine.toLowerCase().split("W+"))) .groupBy((key, word) -> word) .count(); KTable
  • 28.
    28KSQL- Streaming SQLfor Apache Kafka Kafka Streams A complete streaming microservices, ready for production at large-scale App configuration Define processing (here: WordCount) Start processing
  • 29.
    29KSQL- Streaming SQLfor Apache Kafka Agenda – KSQL Deep Dive 1) Apache Kafka Ecosystem 2) Kafka Streams as Foundation for KSQL 3) Motivation for KSQL 4) KSQL Concepts 5) Live Demo #1 – Intro to KSQL 6) KSQL Architecture 7) Live Demo #2 - Clickstream Analysis 8) Building a User Defined Function (Example: Machine Learning) 9) Getting Started
  • 30.
    30KSQL- Streaming SQLfor Apache Kafka Why KSQL? Population CodingSophistication Realm of Stream Processing New, Expanded Realm BI Analysts Core Developers Data Engineers Core Developers who don’t like Java Kafka Streams KSQL
  • 31.
    31KSQL- Streaming SQLfor Apache Kafka Trade-Offs • subscribe() • poll() • send() • flush() • mapValues() • filter() • punctuate() • Select…from… • Join…where… • Group by.. Flexibility Simplicity Kafka Streams KSQL Consumer Producer
  • 32.
    32KSQL- Streaming SQLfor Apache Kafka What is it for ? Streaming ETL • Kafka is popular for data pipelines • KSQL enables easy transformations of data within the pipe CREATE STREAM vip_actions AS SELECT userid, page, action FROM clickstream c LEFT JOIN users u ON c.userid = u.user_id WHERE u.level = 'Platinum';
  • 33.
    33KSQL- Streaming SQLfor Apache Kafka What is it for ? Analytics, e.g. Anomaly Detection • Identifying patterns or anomalies in real-time data, surfaced in milliseconds CREATE TABLE possible_fraud AS SELECT card_number, count(*) FROM authorization_attempts WINDOW TUMBLING (SIZE 5 MINUTES) GROUP BY card_number HAVING count(*) > 3;
  • 34.
    34KSQL- Streaming SQLfor Apache Kafka What is it for ? Real Time Monitoring • Log data monitoring, tracking and alerting • Sensor / IoT data CREATE TABLE error_counts AS SELECT error_code, count(*) FROM monitoring_stream WINDOW TUMBLING (SIZE 1 MINUTE) WHERE type = 'ERROR' GROUP BY error_code;
  • 35.
    35KSQL- Streaming SQLfor Apache Kafka What is it for ? Simple Derivations of Existing Topics • One-liner to re-partition and / or re-key a topic for new uses CREATE STREAM views_by_userid WITH (PARTITIONS=6, VALUE_FORMAT=‘JSON’, TIMESTAMP=‘view_time’) AS SELECT * FROM clickstream PARTITION BY user_id;
  • 36.
    36KSQL- Streaming SQLfor Apache Kafka What is it for ? Easily Convert between Data Formats in Real Time • JSON-to-Avro conversion CREATE STREAM sensor_events_json (sensor_id VARCHAR, temperature INTEGER, ...) WITH (KAFKA_TOPIC='events-topic', VALUE_FORMAT='JSON'); CREATE STREAM sensor_events_avro WITH (VALUE_FORMAT='AVRO') AS SELECT * FROM sensor_events_json;
  • 37.
    37KSQL- Streaming SQLfor Apache Kafka Where is KSQL not such a great fit (at least not yet)? Powerful ad-hoc query ○ Limited span of time usually retained in Kafka ○ No indexes BI reports (Tableau etc.) ○ No indexes ○ No JDBC (most Bi tools are not good with continuous results!) Keep in mind: Kafka is a streaming platform!
  • 38.
    38KSQL- Streaming SQLfor Apache Kafka Agenda – KSQL Deep Dive 1) Apache Kafka Ecosystem 2) Kafka Streams as Foundation for KSQL 3) Motivation for KSQL 4) KSQL Concepts 5) Live Demo #1 – Intro to KSQL 6) KSQL Architecture 7) Live Demo #2 - Clickstream Analysis 8) Building a User Defined Function (Example: Machine Learning) 9) Getting Started
  • 39.
    39KSQL- Streaming SQLfor Apache Kafka KSQL – A Streaming SQL Engine for Apache Kafka
  • 40.
    40KSQL- Streaming SQLfor Apache Kafka KSQL is Equally viable for S / M / L / XL / XXL use cases Ok. Ok. Ok. … and KSQL is ready for production, including 24/7 support!
  • 41.
    41KSQL- Streaming SQLfor Apache Kafka KSQL is Equally viable for S / M / L / XL / XXL use cases
  • 42.
    42KSQL- Streaming SQLfor Apache Kafka How do you deploy applications?
  • 43.
    43KSQL- Streaming SQLfor Apache Kafka Where to develop and operate your applications?
  • 44.
    44KSQL- Streaming SQLfor Apache Kafka KSQL Concepts ● No need for source code • Zero, none at all, not even one line. • No SerDes, no generics, no lambdas, ... ● All the Kafka Streams “magic” out-of-the-box • Exactly Once Semantics • Windowing • Event-time aggregation • Late-arriving data • Distributed, fault-tolerant, scalable, ...
  • 45.
    45KSQL- Streaming SQLfor Apache Kafka STREAM and TABLE as first-class citizens
  • 46.
    46KSQL- Streaming SQLfor Apache Kafka SELECT statement syntax SELECT `select_expr` [, ...] FROM `from_item` [, ...] [ WINDOW `window_expression` ] [ WHERE `condition` ] [ GROUP BY `grouping expression` ] [ HAVING `having_expression` ] [ LIMIT n ] where from_item is one of the following: stream_or_table_name [ [ AS ] alias] from_item LEFT JOIN from_item ON join_condition
  • 47.
    47KSQL- Streaming SQLfor Apache Kafka CREATE STREAM AS syntax CREATE STREAM `stream_name` [WITH (`property = expression` [, …] ) ] AS SELECT `select_expr` [, ...] FROM `from_item` [, ...] [ WHERE `condition` ] [ PARTITION BY `column_name` ] ● where property can be any of the following: KAFKA_TOPIC = name - what to call the sink topic FORMAT = DELIMITED | JSON | AVRO - defaults to format of input stream AVROSCHEMAFILE = path/to/file - if FORMAT=AVRO, where the output schema file will be written to PARTITIONS = # - number of partitions in sink topic TIMESTAMP = column - The name of the column to use as the timestamp. This can be used to define the event time.
  • 48.
    48KSQL- Streaming SQLfor Apache Kafka CREATE TABLE AS syntax CREATE TABLE `stream_name` [WITH ( `property_name = expression` [, ...] )] AS SELECT `select_expr` [, ...] FROM `from_item` [, ...] [ WINDOW `window_expression` ] [ WHERE `condition` ] [ GROUP BY `grouping expression` ] [ HAVING `having_expression` ] ● where property values are same as for ‚Create Streams as Select‘
  • 49.
    49KSQL- Streaming SQLfor Apache Kafka Automatic Inference of Topic Schema (leveraging Confluent Schema Registry) https://www.matthowlett.com/2017-12-23-exploring-wikipedia-ksql.html
  • 50.
    50KSQL- Streaming SQLfor Apache Kafka WINDOWing ● Not ANSI SQL ! à Continuous Queries ● Three types supported (same as Kafka Streams): • TUMBLING (repeats at a non-overlapping interval) • SELECT appname, ip, COUNT(appname) AS problem_count FROM logstream WINDOW TUMBLING (size 1 minute) WHERE loglevel='ERROR' GROUP BY appname, ip; • HOPPING (similar to tumbling, but hopping generally has an overlapping interval) • SELECT itemid, SUM(arraycol[0]) FROM orders WINDOW HOPPING ( size 20 second, advance by 5 second) GROUP BY itemid; • SESSION (groups elements by sessions of activity, do not overlap and do not have a fixed start and end time, closes when it does not receive elements for a certain period of time, i.e., when a gap of inactivity occurred) • SELECT itemid, SUM(sales_price) FROM orders WINDOW SESSION (20 second) GROUP BY itemid;
  • 51.
    51KSQL- Streaming SQLfor Apache Kafka Agenda – KSQL Deep Dive 1) Apache Kafka Ecosystem 2) Kafka Streams as Foundation for KSQL 3) Motivation for KSQL 4) KSQL Concepts 5) Live Demo #1 – Intro to KSQL 6) KSQL Architecture 7) Live Demo #2 - Clickstream Analysis 8) Building a User Defined Function (Example: Machine Learning) 9) Getting Started
  • 52.
    52KSQL- Streaming SQLfor Apache Kafka KSQL CLI ksql> CREATE STREAM pageviews_original (viewtime bigint, userid varchar, pageid varchar) WITH (kafka_topic='pageviews', value_format='DELIMITED'); ksql> CREATE TABLE users_original (registertime bigint, gender varchar, regionid varchar, userid varchar) WITH (kafka_topic='users', value_format='JSON'); ksql> SELECT pageid FROM pageviews_original LIMIT 10; ksql> CREATE STREAM pageviews_female AS SELECT users_original.userid AS userid, pageid, regionid, gender FROM pageviews_original LEFT JOIN users_original ON pageviews_original.userid = users_original.userid WHERE gender = 'FEMALE'; ksql/bin/ksql-server-start ksql/bin/ksql
  • 53.
    53KSQL- Streaming SQLfor Apache Kafka KSQL Web UI
  • 54.
    54KSQL- Streaming SQLfor Apache Kafka Live Demo – KSQL Getting Started
  • 55.
    55KSQL- Streaming SQLfor Apache Kafka Break
  • 56.
    56KSQL- Streaming SQLfor Apache Kafka Agenda – KSQL Deep Dive 1) Apache Kafka Ecosystem 2) Kafka Streams as Foundation for KSQL 3) Motivation for KSQL 4) KSQL Concepts 5) Live Demo #1 – Intro to KSQL 6) KSQL Architecture 7) Live Demo #2 - Clickstream Analysis 8) Building a User Defined Function (Example: Machine Learning) 9) Getting Started
  • 57.
    57KSQL- Streaming SQLfor Apache Kafka KSQL - Components KSQL has 3 main components: 1. The Engine which actually runs the Kafka Streams topologies 2. The REST server interface enables an Engine to receive instructions from the CLI or any other client 3. The CLI, designed to be familiar to users of MySQL, Postgres etc. (Note that you also need a Kafka Cluster… KSQL is deployed independently)
  • 58.
    58KSQL- Streaming SQLfor Apache Kafka How to run KSQL è #1 Client – Server (Interactive Mode) JVM KSQL Server KSQL CLI / any REST Client JVM KSQL Server JVM KSQL Server Kafka Cluster
  • 59.
    59KSQL- Streaming SQLfor Apache Kafka How to run KSQL è #1 Client – Server (Interactive Mode) • Start any number of server nodes bin/ksql-server-start • Start one or more CLIs or REST Clients and point them to a server bin/ksql https://myksqlserver:8090 • All servers share the processing load Technically, instances of the same Kafka Streams Applications scale up / down without restart
  • 60.
    60KSQL- Streaming SQLfor Apache Kafka How to run KSQL è #2 as Standalone Application (Headless Mode) JVM KSQL Server JVM KSQL Server JVM KSQL Server Kafka Cluster
  • 61.
    61KSQL- Streaming SQLfor Apache Kafka How to run KSQL è #2 as Standalone Application (Headless Mode) • Start any number of server nodes Pass a file of KSQL statement to execute bin/ksql-node query-file=foo/bar.sql • Ideal for streaming ETL application deployment Version-control your queries and transformations as code • All running engines share the processing load Technically, instances of the same Kafka Streams Applications scale up / down without restart
  • 62.
    62KSQL- Streaming SQLfor Apache Kafka How to run KSQL è #3 Embedded in an Application (JVM Mode) JVM App Instance KSQL Engine Application Code JVM App Instance KSQL Engine Application Code JVM App Instance KSQL Engine Application Code Kafka Cluster
  • 63.
    63KSQL- Streaming SQLfor Apache Kafka How to run KSQL è #3 Embedded in an Application (JVM Mode) • Embed directly in your Java application • Generate and execute KSQL queries through the Java API Version-control your queries and transformations as code • All running application instances share the processing load Technically, instances of the same Kafka Streams Applications scale up / down without restart
  • 64.
    64KSQL- Streaming SQLfor Apache Kafka Dedicating resources Join Engines to the same ‘service pool’ by means of the ksql.service.id property
  • 65.
    65KSQL- Streaming SQLfor Apache Kafka Sizing of KSQL Servers 3 Categories of KSQL Queries • Project / Filter, e.g. SELECT <columns> FROM <table/stream> WHERE <condition> • Joins, e.g. Stream-Table Joins • Aggregations, e.g. SUM, COUNT, TOPK, TOPKDISTINCT Sizing Questions • How much Memory, CPU, Disk, Network? • When and how to scale up / down? • How to tune performance? • Etc… Guide for Capacity Planning https://docs.confluent.io/current/ksql/docs/ capacity-planning.html https://docs.confluent.io/current/streams/ sizing.html#streams-sizing
  • 66.
    66KSQL- Streaming SQLfor Apache Kafka Agenda – KSQL Deep Dive 1) Apache Kafka Ecosystem 2) Kafka Streams as Foundation for KSQL 3) Motivation for KSQL 4) KSQL Concepts 5) Live Demo #1 – Intro to KSQL 6) KSQL Architecture 7) Live Demo #2 - Clickstream Analysis 8) Building a User Defined Function (Example: Machine Learning) 9) Getting Started
  • 67.
    67KSQL- Streaming SQLfor Apache Kafka Demo: Clickstream Analysis Kafka Producer Elastic search Grafana Kafka Cluster Kafka Connect KSQL Stream of Log Events Kafka Ecosystem Other Components
  • 68.
    68KSQL- Streaming SQLfor Apache Kafka Demo - Clickstream Analysis • https://docs.confluent.io/current/ksql/docs/tutorials/clickstream-docker.html#ksql-clickstream- docker • Leverages Apache Kafka, Kafka Connect, KSQL, Elasticsearch and Grafana • 5min screencast: https://www.youtube.com/watch?v=A45uRzJiv7I • Setup in 5 minutes (with or without Docker) SELECT STREAM CEIL(timestamp TO HOUR) AS timeWindow, productId, COUNT(*) AS hourlyOrders, SUM(units) AS units FROM Orders GROUP BY CEIL(timestamp TO HOUR), productId; timeWindow | productId | hourlyOrders | units ------------+-----------+--------------+------- 08:00:00 | 10 | 2 | 5 08:00:00 | 20 | 1 | 8 09:00:00 | 10 | 4 | 22 09:00:00 | 40 | 1 | 45 ... | ... | ... | ...
  • 69.
    69KSQL- Streaming SQLfor Apache Kafka Live Demo – KSQL Clickstream Analysis
  • 70.
    70KSQL- Streaming SQLfor Apache Kafka Agenda – KSQL Deep Dive 1) Apache Kafka Ecosystem 2) Kafka Streams as Foundation for KSQL 3) Motivation for KSQL 4) KSQL Concepts 5) Live Demo #1 – Intro to KSQL 6) KSQL Architecture 7) Live Demo #2 - Clickstream Analysis 8) Building a User Defined Function (Example: Machine Learning) 9) Getting Started
  • 71.
    71KSQL- Streaming SQLfor Apache Kafka Advanced Use Case - Machine Learning UDF for Real Time Sensor Analytics Kafka Producer Elastic search Grafana Kafka Cluster Kafka Connect KSQL Heath Check Sensor Kafka Ecosystem Other Components Emergency System All Data Apply Analytic Model Filter Predictions
  • 72.
    72KSQL- Streaming SQLfor Apache Kafka KSQL and Deep Learning (Autoencoder) for IoT Sensor Analytics https://www.confluent.io/blog/write-user-defined-function-udf-ksql/ https://github.com/kaiwaehner/ksql-machine-learning-udf “SELECT event_id, anomaly(SENSORINPUT) FROM health_sensor;“ KSQL UDF using an analytic model under the hood à Write once, use in any KSQL statement
  • 73.
    73KSQL- Streaming SQLfor Apache Kafka KSQL UDF https://www.confluent.io/blog/write-user-defined-function-udf-ksql/ https://github.com/kaiwaehner/ksql-machine-learning-udf
  • 74.
    74KSQL- Streaming SQLfor Apache Kafka Use Case: Anomaly Detection (Sensor Healthcheck) Machine Learning Algorithm: Autoencoder built with H2O Streaming Platform: Apache Kafka and KSQL Live Demo – Prebuilt Model Embedded in KSQL Function
  • 75.
    75KSQL- Streaming SQLfor Apache Kafka Agenda – KSQL Deep Dive 1) Apache Kafka Ecosystem 2) Kafka Streams as Foundation for KSQL 3) Motivation for KSQL 4) KSQL Concepts 5) Live Demo #1 – Intro to KSQL 6) KSQL Architecture 7) Live Demo #2 - Clickstream Analysis 8) Building a User Defined Function (Example: Machine Learning) 9) Getting Started
  • 76.
    76KSQL- Streaming SQLfor Apache Kafka KSQL Quick Start github.com/confluentinc/ksql Local runtime or Docker container
  • 77.
    77KSQL- Streaming SQLfor Apache Kafka Resources and Next Steps Get Involved • Try the Quickstart on GitHub • Check out the code • Play with the examples KSQL is GA… You can already use it for production deployments! https://github.com/confluentinc/ksql http://confluent.io/ksql https://slackpass.io/confluentcommunity #ksql
  • 78.
  • 79.
    79KSQL- Streaming SQLfor Apache Kafka Kai Waehner Technology Evangelist kontakt@kai-waehner.de @KaiWaehner www.confluent.io www.kai-waehner.de LinkedIn Questions? Feedback? Please contact me…