Introduction to apache kafka, confluent and why they matter

11 Introduction to Apache Kafka and Confluent ... and why they matter! Kafka Meetup - Johannesburg Tuesday, March 20th 2018 18:00 – 20:00 SSA - Maxwell Office Park, Magwa Cres, Waterfall City, Midrand, 2090 · Midrand https://www.meetup.com/Johannesburg-Kafka-Meetup/events/248465767/

22 How Organizations Handle Data Flows: a Giant Mess Data Warehouse Hadoop NoSQL Oracle SFDC Logging Bloomberg …any sink/source Web Custom Apps Microservices Monitoring Analytics …and more OLTP ActiveMQ App App Caches OLTP OLTPAppAppApp

33 Apache Kafka™: A Distributed Streaming Platform Apache Kafka Offline Batch (+1 Hour)Near-Real Time (>100s ms)Real Time (0-100 ms) Data Warehouse Hadoop NoSQL Oracle SFDC Twitter Bloomberg …any sink/source …any sink/source …and more Web Custom Apps Microservices Monitoring Analytics

44 More than 1 petabyte of data in Kafka Over 1.2 trillion messages per day Thousands of data streams Source of all data warehouse & Hadoop data Over 300 billion user- related events per day

55 Over 35% of Fortune 500’s are using Apache Kafka™ 6 of top 10 Travel 7 of top 10 Global banks 8 of top 10 Insurance 9 of top 10 Telecom

66 Industry Trends… and why Apache Kafka matters! 1. From ‘big data’ (batch) to ‘fast data’ (stream processing) 2. Internet of Things (IoT) and sensor data 3. Microservices and asynchronous communication (coordination messages and data streams) between loosely coupled and fine- grained services

77 Apache Kafka APIs – A UNIX Analogy $ cat < in.txt | grep "apache" | tr a-z A-Z > out.txt Connect APIs Streams APIs Producer / Consumer APIs

88 Apache Kafka API – ETL Analogy Source SinkConnectAPI ConnectAPI Streams API Extract Transform Load

99 Apache Kafka 101 Internals and Core Concepts

1010 Apache Kafka Concepts: Persistent Log Data Producer 0 1 2 3 4 5 6 7 8 9 10 11 12 writes Data Consumer (offset = 7) Data Consumer (offset = 11) reads reads

1111 Apache Kafka Concepts: Anatomy of a Topic 0 1 2 3 4 5 6 7 8 9 10 11 12partition 0 0 1 2 3 4 5 6 7 40 1 2 3 5 partition 1 partition 2 writes

1212 Apache Kafka Concepts: Log Storage offset index timestamp index offsets: 0 - 10000 offset index timestamp index offsets: 10001 - 20000 offset index timestamp index offsets: 20001 - 30000

1313 Apache Kafka Concepts: Message Format 8 bytes 4 bytes 4 bytes 8 bytes 4 bytes varies 4 bytes varies offset length CRC timesta mp key length value length key content value content magic byte 1 byte attribute 1 byte

1414 Apache Kafka Concepts: Producers and Consumers Producer Producer Producer Consumer Consumer Broker Broker Broker

1515 Apache Kafka Concepts: Topics and Partitions Producer Producer Producer Consumer Consumer Broker Broker Broker T0: P0 T0: P2 T0: P1 T0: P3 T1: P0 T1: P1

1616 Apache Kafka Concepts: Fault Tolerance and Replication Producer Producer Producer Consumer Consumer Broker Broker Broker T0: P0 T0: P0 (Replica 1) T1: P0 T1: P0 (Replica 1)

1717 Apache Kafka Concepts: Consumer Groups Producer Producer Producer Consumer Broker Broker Broker T0: P0 T0: P2 T0: P1 T0: P3 T1: P0 T1: P1 Consumer Consumer Consumer

1818 The Connect API of Apache Kafka®  Centralized management and configuration  Support for hundreds of technologies including RDBMS, Elasticsearch, HDFS, S3  Supports CDC ingest of events from RDBMS  Preserves data schema  Fault tolerant and automatically load balanced  Extensible API  Single Message Transforms  Part of Apache Kafka, included in Confluent Open Source Reliable and scalable integration of Kafka with other systems – no coding required. { "connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector", "connection.url": "jdbc:mysql://localhost:3306/demo?user=rmoff&password=foo", "table.whitelist": "sales,orders,customers" } https://docs.confluent.io/current/connect/

1919 Build Applications, not Clusters <dependency> <groupId>org.apache.kafka</groupId> <artifactId>kafka-streams</artifactId> <version>1.0.0</version> </dependency>

2121 How do I run in production?

2222 How do I run in production? Uncool Cool

2323 How do I run in production? http://docs.confluent.io/current/streams/introduction.html

2424 Elastic and Scalable http://docs.confluent.io/current/streams/developer-guide.html#elastic-scaling-of-your-application

2727 Typical High Level Architecture Real-time Data Ingestion

2828 Typical High Level Architecture Stream Processing Real-time Data Ingestion

2929 Typical High Level Architecture Stream Processing Storage Real-time Data Ingestion

3030 Typical High Level Architecture Data Publishing / Visualization Stream Processing Storage Real-time Data Ingestion

3131 How many clusters do you count? NoSQL (Cassandra, HBase, Couchbase, MongoDB, …) or Elasticsearch, Solr, … Storm, Flink, Spark Streaming, Ignite, Akka Streams, Apex, … HDFS, NFS, Ceph, GlusterFS, Lustre, ... Apache Kafka

3232 Simplicity is the Ultimate Sophistication Node.js Apache Kafka Distributed Streaming Platform Publish & Subscribe to streams of data like a messaging system Store streams of data safely in a distributed replicated cluster Process streams of data efficiently and in real-time

3333 Duality of Streams and Tables http://docs.confluent.io/current/streams/concepts.html#duality-of-streams-and-tables

3434 Duality of Streams and Tables http://docs.confluent.io/current/streams/concepts.html#duality-of-streams-and-tables

3535 Interactive Queries http://docs.confluent.io/current/streams/developer-guide.html#streams-developer-guide-interactive-queries

3636 Interactive Queries http://docs.confluent.io/current/streams/developer-guide.html#streams-developer-guide-interactive-queries

3737 Kafka Streams DSL http://docs.confluent.io/current/streams/developer-guide.html#kafka-streams-dsl

3838 WordCount (and Java 8+) WordCountLambdaExample.java final Properties streamsConfiguration = new Properties(); streamsConfiguration.put(StreamsConfig.APPLICATION_ID_CONFIG, "wordcount-lambda-example"); streamsConfiguration.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrapServers); ... final Serde<String> stringSerde = Serdes.String(); final Serde<Long> longSerde = Serdes.Long(); final KStreamBuilder builder = new KStreamBuilder(); final KStream<String, String> textLines = builder.stream(stringSerde, stringSerde, "TextLinesTopic"); final Pattern pattern = Pattern.compile("W+", Pattern.UNICODE_CHARACTER_CLASS); final KTable<String, Long> wordCounts = textLines .flatMapValues(value -> Arrays.asList(pattern.split(value.toLowerCase()))) .groupBy((key, word) -> word) .count("Counts"); wordCounts.to(stringSerde, longSerde, "WordsWithCountsTopic"); final KafkaStreams streams = new KafkaStreams(builder, streamsConfiguration); streams.cleanUp(); streams.start(); Runtime.getRuntime().addShutdownHook(new Thread(streams::close));

3939 Easy to Develop with, Easy to Test WordCountLambdaIntegrationTest.java EmbeddedSingleNodeKafkaCluster CLUSTER = new EmbeddedSingleNodeKafkaCluster(); ... CLUSTER.createTopic(inputTopic); ... Properties producerConfig = new Properties(); producerConfig.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, CLUSTER.bootstrapServers());

4040 The Streams API of Apache Kafka®  No separate processing cluster required  Develop on Mac, Linux, Windows  Deploy to containers, VMs, bare metal, cloud  Powered by Kafka: elastic, scalable, distributed, battle-tested  Perfect for small, medium, large use cases  Fully integrated with Kafka security  Exactly-once processing semantics  Part of Apache Kafka, included in Confluent Open Source Write standard Java applications and microservices to process your data in real-time KStream<User, PageViewEvent> pageViews = builder.stream("pageviews-topic"); KTable<Windowed<User>, Long> viewsPerUserSession = pageViews .groupByKey() .count(SessionWindows.with(TimeUnit.MINUTES.toMillis(5)), "session-views"); https://docs.confluent.io/current/streams/

4141 KSQL: a Streaming SQL Engine for Apache Kafka® from Confluent  No coding required, all you need is SQL  No separate processing cluster required  Powered by Kafka: elastic, scalable, distributed, battle-tested CREATE TABLE possible_fraud AS SELECT card_number, count(*) FROM authorization_attempts WINDOW TUMBLING (SIZE 5 SECONDS) GROUP BY card_number HAVING count(*) > 3; CREATE STREAM vip_actions AS SELECT userid, page, action FROM clickstream c LEFT JOIN users u ON c.userid = u.userid WHERE u.level = 'Platinum'; KSQL is the simplest way to process streams of data in real-time  Perfect for streaming ETL, anomaly detection, event monitoring, and more  Part of Confluent Open Source https://github.com/confluentinc/ksql

Do you think that’s a table you are querying ?

4343 KSQL in less than 5 minutes https://www.youtube.com/watch?v=A45uRzJiv7I

4444 Confluent Enterprise: Logical Architecture Kafka Cluster Mainframe Kafka Connect Servers Kafka ConnectRDBMS Hadoop Cassandra Elasticsearch Kafka Connect Servers Kafka Connect Files Producer Application Consumer ApplicationZookeeper Kafka Broker REST Proxy Servers REST Proxy REST Client Control Center Servers Control Center Schema Registry Servers Schema Registry Kafka Producer APIs Kafka Consumer APIs Stream Processing Application 1 Stream Client Stream Processing Application 2 Stream Client REST Proxy Servers REST Proxy REST Client

4545 Confluent Enterprise: Physical Architecture Rack 1 Kafka Broker #1 ToR Switch ToR Switch Schema Registry #1 Kafka Connect #1 Zookeeper #1 REST Proxy #1 Kafka Broker #4 Zookeeper #4 Rack 2 Kafka Broker #2 ToR Switch ToR Switch Schema Registry #2 Kafka Connect #2 Zookeeper #2 Kafka Broker #5 Zookeeper #5 Rack 3 Kafka Broker #3 ToR Switch ToR Switch Kafka Connect #3 Zookeeper #3 Core Switch Core Switch REST Proxy #2 Load Balancer Load Balancer Control Center #1 Control Center #2

4646 Confluent Completes Kafka Feature Benefit Apache Kafka Confluent Open Source Confluent Enterprise Apache Kafka High throughput, low latency, high availability, secure distributed streaming platform Kafka Connect API Advanced API for connecting external sources/destinations into Kafka Kafka Streams API Simple library that enables streaming application development within the Kafka framework Additional Clients Supports non-Java clients; C, C++, Python, .NET and several others REST Proxy Provides universal access to Kafka from any network connected device via HTTP Schema Registry Central registry for the format of Kafka data – guarantees all data is always consumable Pre-Built Connectors HDFS, JDBC, Elasticsearch, Amazon S3 and other connectors fully certified and supported by Confluent JMS Client Support for legacy Java Message Service (JMS) applications consuming and producing directly from Kafka Confluent Control Center Enables easy connector management, monitoring and alerting for a Kafka cluster Auto Data Balancer Rebalancing data across cluster to remove bottlenecks Replicator Multi-datacenter replication simplifies and automates MDC Kafka clusters Support Enterprise class support to keep your Kafka environment running at top performance Community Community 24x7x365

4747 Big Data and Fast Data Ecosystems Synchronous Req/Response 0 – 100s ms Near Real Time > 100s ms Offline Batch > 1 hour Apache Kafka Stream Data Platform Search RDBMS Apps Monitoring Real-time Analytics NoSQL Stream Processing Apache Hadoop Data Lake Impala DWH Hive Spark Map-Reduce Confluent HDFS Connector (exactly once semantics) https://www.confluent.io/blog/the-value-of-apache-kafka-in-big-data-ecosystem/

4848 Building a Microservices Ecosystem with Kafka Streams and KSQL https://www.confluent.io/blog/building-a-microservices-ecosystem-with-kafka-streams-and-ksql/ https://github.com/confluentinc/kafka-streams-examples/tree/3.3.0-post/src/main/java/io/confluent/examples/streams/microservices

4949 Microservices: References Blog posts series: Part 1: The Data Dichotomy: Rethinking the Way We Treat Data and Services https://www.confluent.io/blog/data-dichotomy-rethinking-the-way-we-treat-data-and-services/ Part 2: Build Services on a Backbone of Events https://www.confluent.io/blog/build-services-backbone-events/ Part 3: Using Apache Kafka as a Scalable, Event-Driven Backbone for Service Architectures https://www.confluent.io/blog/apache-kafka-for-service-architectures/ Part 4: Chain Services with Exactly Once Guarantees https://www.confluent.io/blog/chain-services-exactly-guarantees/ Part 5: Messaging as the Single Source of Truth https://www.confluent.io/blog/messaging-single-source-truth/ Part 6: Leveraging the Power of a Database Unbundled https://www.confluent.io/blog/leveraging-power-database-unbundled/ Part 7: Building a Microservices Ecosystem with Kafka Streams and KSQL https://www.confluent.io/blog/building-a-microservices-ecosystem-with-kafka-streams-and-ksql/ Whitepaper: Microservices in the Apache Kafka™ Ecosystem https://www.confluent.io/resources/microservices-in-the-apache-kafka-ecosystem/

5050 Apache Kafka Security Security • Processes customer data • Regulatory requirements • Legal compliance • Internal security policies • Need is not limited to industries such as finance, healthcare, or governmental services Authentication • Scenario example: “Only certain applications may talk to the production Kafka cluster” • Client authentication via SASL – e.g. Kerberos, Active Directory Authorization • Scenario example: “Only certain applications may read data from sensitive Kafka topics” • Restrict who can create, write to, read from topics, and more Encryption • Scenario example: “Data-in-transit between apps and Kafka clusters must be encrypted” • SSL supported • Encrypts data exchanged between Kafka brokers, between Kafka brokers and Kafka clients/apps Help meeting security requirements by supporting:

5151 Enterprise Ready Multi-Datacenter Replication for Kafka Data Center in USA Kafka Cluster (USA) Kafka Broker 1 Kafka Broker 2 Kafka Broker 3 ZooKeeper 1 ZooKeeper 2 ZooKeeper 3 Control Center Kafka Connect Cluster Replicator 1 Replicator 2 Data Center in EMEA Kafka Cluster (EU) Kafka Broker 1 Kafka Broker 2 Kafka Broker 3 ZooKeeper 1 ZooKeeper 2 ZooKeeper 3 Control Center Kafka Connect Cluster Replicator 1 Replicator 2 Available only with Confluent Enterprise Apache Kafka and Confluent Open Source

5252 Cloud Synchronization and Migrations with Confluent Enterprise: Before DC1 DB2 DB1 DWH App2 App3 App4 KV2KV3 DB3 App2-v2 App5 App7 App1-v2 AWS App8 DWH App1 Challenges • Each team/department must execute their own cloud migration • May be moving the same data multiple times • Each box represented here require development, testing, deployment, monitoring and maintenance KV

5353 DC1 Cloud Synchronization and Migrations with Confluent Enterprise: After DB2 DB1 KV DWH App2 App4 KV2KV3 App2-v2 App5 App7 App1-v2 AWS App8 DWH App1 Kafka Kafka App3 Benefits • Continuous low-latency synchronization • Centralized manageability and monitoring – Track at event level data produced in all data centers • Security and governance – Track and control where data comes from and who is accessing it • Cost Savings – Move Data Once DB3

5454 About Confluent and Apache Kafka™ 70% of active Kafka Committers Founded September 2014 Technology developed while at LinkedIn Founded by the creators of Apache Kafka

5555 Apache Kafka: PMC members and committers https://kafka.apache.org/committers PMC PMC PMC PMCPMC PMC PMC PMC PMC PMC PMC

5656 Download Confluent Platform: the easiest way to get you started https://www.confluent.io/download/

5757 Books: get them all three in PDF format from Confluent website! https://www.confluent.io/apache-kafka-stream-processing-book-bundle

5858 Discount code: kacom17 Presented by https://kafka-summit.org/ Presented by

Introduction to apache kafka, confluent and why they matter

More Related Content

What's hot

Similar to Introduction to apache kafka, confluent and why they matter

More from Paolo Castagna

Recently uploaded

In this document

Introduction to apache kafka, confluent and why they matter