What is Apache Kafka and What is an Event Streaming Platform?

1 What is Apache Kafka and What is an Event Streaming Platform? Bern Apache Kafka® Meetup

2 Join the Confluent Community Slack Channel Subscribe to the Confluent blog cnfl.io/community-slack cnfl.io/read Welcome to the Apache Kafka® Meetup in Bern! 6:00pm Doors open 6:00pm - 6:30pm Food, Drinks and Networking 6:30pm – 6:50pm Matthias Imsand, Amanox Solutions 6:50pm - 7:35pm Gabriel Schenker, Confluent 7:35pm - 8:00pm Additional Q&A & Networking Apache, Apache Kafka, Kafka and the Kafka logo are trademarks of the Apache Software Foundation. The Apache Software Foundation has no affiliation with and does not endorse the materials provided at this event.

3 About Me ● Gabriel N. Schenker ● Lead Curriculum Developer @ Conﬂuent ● Formerly at Docker, Alienvault, … ● Lives in Appenzell, AI ● Github: github.org/gnschenker ● Twitter: @gnschenker

44 What is an Event Streaming Platform?

5 Event Streaming Platforms should do two things: Reliably store streams of events Process streams of events

66 The Event Streaming Paradigm

77 ETL/Data Integration Messaging Batch Expensive Time Consuming Diﬃcult to Scale No Persistence Data Loss No Replay High Throughput Durable Persistent Maintains Order Fast (Low Latency)

88 ETL/Data Integration Messaging Batch Expensive Time Consuming Diﬃcult to Scale No Persistence Data Loss No Replay High Throughput Durable Persistent Maintains Order Fast (Low Latency)

99 ETL/Data Integration Messaging Transient MessagesStored records ETL/Data Integration MessagingMessaging Batch Expensive Time Consuming Diﬃcult to Scale No Persistence Data Loss No Replay High Throughput Durable Persistent Maintains Order Fast (Low Latency) Event Streaming Paradigm High Throughput Durable Persistent Maintains Order Fast (Low Latency)

1010 Fast (Low Latency) Event Streaming Paradigm To rethink data as not stored records or transient messages, but instead as a continually updating stream of events

1111 Fast (Low Latency) Event Streaming Paradigm

16C O N F I D E N T I A L Mainframes Hadoop Data Warehouse ... Device Logs ... Splunk ... App App Microservice ... Data Stores Custom Apps/MicroservicesLogs 3rd Party Apps Universal Event Pipeline Real-Time Inventory Real-Time Fraud Detection Real-Time Customer 360 Machine Learning Models Real-Time Data Transformation ... Contextual Event-Driven Apps Apache Kafka® STREAMS CONNECT CLIENTS

18 A Modern, Distributed Platform for Data Streams

19 Apache Kafka® is made up of distributed, immutable, append-only commit logs

20 Writers Kafka cluster Readers

2121 Kafka: Scalability of a ﬁlesystem • hundreds of MB/s throughput • many TB per server • commodity hardware

2222 Kafka: Guarantees of a Database • Strict ordering • Persistence

2323 Kafka: Rewind and Replay Rewind & Replay Reset to any point in the shared narrative

2424 Kafka: Distributed by design • Replication • Fault Tolerance • Partitioning • Elastic Scaling

2525 Kafka Topics my-topic my-topic-partition-0 my-topic-partition-1 my-topic-partition-2 broker-1 broker-2 broker-3

2626 Creating a topic $ kafka-topics --bootstrap-server broker101:9092 --create --topic my-topic --replication-factor 3 --partitions 3

2828 Producing to Kafka Time C CC

2929 Partition Leadership and Replication Broker 1 Topic1 partition1 Broker 2 Broker 3 Broker 4 Topic1 partition1 Topic1 partition1 Leader Follower Topic1 partition2 Topic1 partition2 Topic1 partition2 Topic1 partition3 Topic1 partition4 Topic1 partition3 Topic1 partition3 Topic1 partition4 Topic1 partition4

3030 Partition Leadership and Replication - node failure Broker 1 Topic1 partition1 Broker 2 Broker 3 Broker 4 Topic1 partition1 Topic1 partition1 Leader Follower Topic1 partition2 Topic1 partition2 Topic1 partition2 Topic1 partition3 Topic1 partition4 Topic1 partition3 Topic1 partition3 Topic1 partition4 Topic1 partition4

3232 Producer Clients - Producer Design Producer Record Topic [Partition] [Timestamp] Value Serializer Partitioner Topic A Partition 0 Batch 0 Batch 1 Batch 2 Topic B Partition 1 Batch 0 Batch 1 Batch 2 Kafka Broker Send() Retry ? Fail ? Yes No Can’t retry, throw exception Success: return metadata Yes [Headers] [Key]

3333 The Serializer Kafka doesn’t care about what you send to it as long as it’s been converted to a byte stream beforehand. JSON CSV Avro Protobufs XML SERIALIZERS 01001010 01010011 01001111 01001110 01000011 01010011 01010110 01001010 01010011 01001111 01001110 01010000 01110010 01101111 01110100 ... 01011000 01001101 01001100 (if you must) Reference https://kafka.apache.org/10/documentation/streams/developer-guide/datatypes.html

3434 The Serializer private Properties settings = new Properties(); settings.put("bootstrap.servers", "broker1:9092,broker2:9092"); settings.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer"); settings.put("value.serializer", "io.confluent.kafka.serializers.KafkaAvroSerializer"); settings.put("schema.registry.url", "https://schema-registry:8083"); producer = new KafkaProducer<String, Invoice>(settings); Reference https://kafka.apache.org/10/documentation/streams/developer-guide/datatypes.html

3535 Producer Record Topic [Partition] [Key] Value Record keys determine the partition with the default kafka partitioner If a key isn’t provided, messages will be produced in a round robin fashion partitioner Record Keys and why they’re important - Ordering

3636 Producer Record Topic [Partition] AAAA Value Record keys determine the partition with the default kafka partitioner, and therefore guarantee order for a key Keys are used in the default partitioning algorithm: partition = hash(key) % numPartitions partitioner Record Keys and why they’re important - Ordering

3737 Producer Record Topic [Partition] BBBB Value Keys are used in the default partitioning algorithm: partition = hash(key) % numPartitions partitioner Record keys determine the partition with the default kafka partitioner, and therefore guarantee order for a key Record Keys and why they’re important - Ordering

3838 Producer Record Topic [Partition] CCCC Value Keys are used in the default partitioning algorithm: partition = hash(key) % numPartitions partitioner Record keys determine the partition with the default kafka partitioner, and therefore guarantee order for a key Record Keys and why they’re important - Ordering

3939 Record Keys and why they’re important - Ordering Producer Record Topic [Partition] DDDD Value Keys are used in the default partitioning algorithm: partition = hash(key) % numPartitions partitioner Record keys determine the partition with the default kafka partitioner, and therefore guarantee order for a key

4040 Record Keys and why they’re important - Key Cardinality Consumers Key cardinality affects the amount of work done by the individual consumers in a group. Poor key choice can lead to uneven workloads. Keys in Kafka don’t have to be primitives, like strings or ints. Like values, they can be be anything: JSON, Avro, etc… So create a key that will evenly distribute groups of records around the partitions. Car·di·nal·i·ty /ˌkärdəˈnalədē/ Noun the number of elements in a set or other grouping, as a property of that grouping.

4141 { “Name”: “John Smith”, “Address”: “123 Apple St.”, “Zip”: “19101” } You don’t have to but... use a Schema! Data Producer Service Data Consumer Service { "Name": "John Smith", "Address": "123 Apple St.", "City": "Philadelphia", "State": "PA", "Zip": "19101" } send JSON “Where’s record.City?” Reference https://www.conﬂuent.io/blog/schema-registry-kafka-stream-processing-yes-virginia-you -really-need-one/

4242 Schema Registry: Make Data Backwards Compatible and Future-Proof ● Define the expected fields for each Kafka topic ● Automatically handle schema changes (e.g. new fields) ● Prevent backwards incompatible changes ● Support multi-data center environments Elastic Cassandra HDFS Example Consumers Serializer App 1 Serializer App 2 ! Kafka Topic! Schema Registry Open Source Feature

4343 Developing with Confluent Schema Registry We provide several Maven plugins for developing with the Confluent Schema Registry ● download - download a subject’s schema to your project ● register - register a new schema to the schema registry from your development env ● test-compatibility - test changes made to a schema against compatibility rules set by the schema registry Reference https://docs.confluent.io/current/schema-registry/docs/maven-plugin.html <plugin> <groupId>io.confluent</groupId> <artifactId>kafka-schema-registry-maven-plugin</ <version>5.0.0</version> <configuration> <schemaRegistryUrls> <param>http://192.168.99.100:8081</p </schemaRegistryUrls> <outputDirectory>src/main/avro</outputDi <subjectPatterns> <param>^TestSubject000-(key|value)$< </subjectPatterns> </configuration> </plugin>

4444 { "Name": "John Smith", "Address": "123 Apple St.", "Zip": "19101", "City": "NA", "State": "NA" } Avro allows for evolution of schemas { "Name": "John Smith", "Address": "123 Apple St.", "City": "Philadelphia", "State": "PA", "Zip": "19101" } Data Producer Service Data Consumer Service send AvroRecord Schema Registry Version 1Version 2 Reference https://www.conﬂuent.io/blog/schema-registry-kafka-stream-processing-yes-virginia-you -really-need-one/

4545 Use Kafka’s Headers Reference https://cwiki.apache.org/conﬂuence/display/KAFKA/KIP-82+-+Add+Record+Headers Producer Record Topic [Partition] [Timestamp] Value [Headers] [Key] Kafka Headers are simply an interface that requires a key of type String, and a value of type byte[], the headers are stored in an iterator in the ProducerRecord . Example Use Cases ● Data lineage: reference previous topic partition/offsets ● Producing host/application/owner ● Message routing ● Encryption metadata (which key pair was this message payload encrypted with?)

4646 Producer Guarantees P Broker 1 Broker 2 Broker 3 Topic1 partition1 Leader Follower Topic1 partition1 Topic1 partition1 Producer Properties acks=0 Reference https://www.conﬂuent.io/blog/exactly-once-semantics-are-p ossible-heres-how-apache-kafka-does-it/

4747 Producer Guarantees P Broker 1 Broker 2 Broker 3 Topic1 partition1 Leader Follower Topic1 partition1 Topic1 partition1 ack Producer Properties acks=1 Reference https://www.conﬂuent.io/blog/exactly-once-semantics-are-p ossible-heres-how-apache-kafka-does-it/

4848 Producer Guarantees P Broker 1 Broker 2 Broker 3 Topic1 partition1 Leader Follower Topic1 partition1 Topic1 partition1 Producer Properties acks=all min.insync.replica=2 ack

4949 Producer Guarantees - without exactly once guarantees P Broker 1 Broker 2 Broker 3 Topic1 partition1 Leader Follower Topic1 partition1 Topic1 partition1 Producer Properties acks=all min.insync.replica=2 {key: 1234 data: abcd} - offset 3345 Failed ack Successful write Reference https://www.conﬂuent.io/blog/exactly-once-semantics-are-p ossible-heres-how-apache-kafka-does-it/

5050 Producer Guarantees - without exactly once guarantees P Broker 1 Broker 2 Broker 3 Topic1 partition1 Leader Follower Topic1 partition1 Topic1 partition1 Producer Properties acks=all min.insync.replica=2 {key: 1234, data: abcd} - offset 3345 {key: 1234, data: abcd} - offset 3346 retry ack dupe! Reference https://www.conﬂuent.io/blog/exactly-once-semantics-are-p ossible-heres-how-apache-kafka-does-it/

5151 Producer Guarantees - with exactly once guarantees P Broker 1 Broker 2 Broker 3 Topic1 partition1 Leader Follower Topic1 partition1 Topic1 partition1 Producer Properties enable.idempotence=true max.inflight.requests.per.connection=5 acks = "all" retries > 0 (preferably MAX_INT) (pid, seq) [payload] (100, 1) {key: 1234, data: abcd} - offset 3345 (100, 1) {key: 1234, data: abcd} - rejected, ack re-sent (100, 2) {key: 5678, data: efgh} - offset 3346 retry ack no dupe! Reference https://www.conﬂuent.io/blog/exactly-once-semantics-are-p ossible-heres-how-apache-kafka-does-it/

5252 Transactional Producer Producer T1 T1 T1 T1 T1 KafkaProducer producer = createKafkaProducer( "bootstrap.servers", "broker:9092", "transactional.id", "my-transactional-id"); producer.initTransactions(); -- send some records -- producer.commitTransaction(); Consumer KafkaConsumer consumer = createKafkaConsumer( "bootstrap.servers", "broker:9092", "group.id", "my-group-id", "isolation.level", "read_committed"); Reference https://www.conﬂuent.io/blog/transactions-apache-kafka/

5454 A basic Java Consumer final Consumer<String, String> consumer = new KafkaConsumer<String, String>(props); consumer.subscribe(Arrays.asList(topic)); try { while (true) { ConsumerRecords<String, String> records = consumer.poll(100); for (ConsumerRecord<String, String> record : records) { // Do Some Work … } } } finally { consumer.close(); } }

5555 Consuming From Kafka - Single Consumer C

5656 Consuming From Kafka - Grouped Consumers CC C1 CC C2

5757 Consuming From Kafka - Grouped Consumers C C C C

5858 Consuming From Kafka - Grouped Consumers 0 1 2 3

5959 Consuming From Kafka - Grouped Consumers 0 1 2 3

6060 Consuming From Kafka - Grouped Consumers 0, 3 1 2 3

6161 Resources Free E-Books from Confluent! https://www.confluent.io/apache-kafka-stream-processing-book-bundle Confluent Blog: https://www.confluent.io/blog Thank You! gabriel@confluent.io @gnschenker

25% off! KS19Comm25 25% off! KS19Comm25

What is Apache Kafka and What is an Event Streaming Platform?

More Related Content

What's hot

Similar to What is Apache Kafka and What is an Event Streaming Platform?

More from confluent

Recently uploaded

In this document

What is Apache Kafka and What is an Event Streaming Platform?