❏ Introduction ❏ First use case: Linkedin ❏ Architecture ❏ Main components ❏ Kafka protocol ❏ Kafka ecosystem ❏ Kafka Connect ❏ Schema Registry ❏ Installation/Configuration ❏ Datalab Use Cases ❏ Demo Contents
Introduction Maslow Hierarchy of human needs
Introduction Data hierarchy of needs Vision mission Products Data Science Data Infrastructure Data Access
Introduction - Data Sources
Introduction - Data Ingestion script to read data aggregation script aggregation script Tweet fetch script script to read data api rest script script to read data
Introduction - Data Ingestion script to read data aggregation script aggregation script Tweet fetch script script to read data api rest script script to read data
Introduction - Data Ingestion Take something in Absorb something
Linkedin data pipeline problem They had a lot of data ● User activity tracking ● Server logs and metrics ● Messaging ● Analytics They Build products on data ● Newsfeed ● Recommendation ● Search ● Metrics and monitoring Problem How to integrate this variety of data and make it available to all their products?
Frontend Server Metrics Server Inter-process communication channel Linkedin data pipeline problem
Many publisher using direct connections Frontend Server Frontend Server Database Server Chat Server Metrics Analysis Metrics UI Database MonitorActive Monitoring Backend server Linkedin data pipeline problem
Publish/subscribe system Frontend Server Frontend Server Database Server Chat Server Metrics Analysis Metrics UI Database MonitorActive Monitoring Backend server Metrics pub/sub Linkedin data pipeline problem
Publish-Subscribe Topic Publisher Publisher Publisher SubscriberSubscriberSubscriber Pattern Protocol Technology AMQP MQTT implements implements
Log Search Multiple publish/subscribe systems Frontend Server Frontend Server Database Server Chat Server Metrics Analysis Metrics UI Database Monitor Active Monitoring Backend server Metrics pub/sub Log Search Offline processing Logging pub/sub Tracking pub/sub Linkedin data pipeline problem
Log Search Linkedin data pipeline problem Custom infrastructure for the data pipeline Frontend Server Frontend Server Database Server Chat Server Metrics Analysis Metrics UI Database Monitor Active Monitoring Backend server Metrics pub/sub Log Search Offline processing Logging pub/sub Tracking pub/sub
Log Search Frontend ServerFrontend Server Database Server Chat Server Metrics Analysis Metrics UI Database Monitor Backend server Log Search Offline processing ● Decouple data pipelines ● Provide persistence for message data to allow multiple consumers ● Optimize for high throughput of messages ● Allow for horizontal scaling of the system to grow as the data stream grow Linkedin data pipeline problem - Kafka Goals
Log Search Kafka Architecture - Elements Frontend ServerFrontend Server Producer Producer Metrics Analysis Consumer Consumer Producer Log Search Consumer Kafka Cluster Kafka → distributed, replicated commit log Broker Partition X / Topic Y
commit log Kafka Architecture - Broker Producer Consumer 1Consumer 2 Read at offset Read at offset
Kafka Cluster Kafka Architecture - Broker Broker 1 Broker 2 Broker 3 Partition 0 / Topic B Partition 0 / Topic C Partition 0 / Topic A Partition 0 / Topic B Partition 1 / Topic A distributed replicated
Kafka Architecture - Producer/Consumer Log Search Frontend Server Producer Consumer Kafka Cluster Basic Concepts ● Latency ● Throughput ● Quality of service: at most once, at least once, exactly once Use Case Requirements o Quality of service / Latency o Throughput / Latency Producer/Consumer Technology o Ingestion Technologies o Kafka Client API o Kafka Connect
Kafka Architecture - Producer / Consumer
Kafka Architecture - Producer/Consumer API Log Search Frontend Server Producer Consumer Kafka Cluster
Kafka Protocol
Kafka Producer Kafka Protocol - Producer Producer Record Topic [Partition] [Key] Value Broker Partition 0 / Topic A Producer.send (record) exception/metadata
Productor Kafka Topic / Partition Buffer Sender Thread Producer Record Topic [Partition] [Key] Value Serializer Partitioner Topic A / Partition 0 Batch 0 Batch 1 Batch 0 / Topic A / Partition 0 Batch 0 / Topic B / Partition 0 Batch 0 / Topic B / Partition 1 Batch 1 / Topic B / Partition 0 Retry Fail Yes Yes No NoException Metadata Topic Partition X Partition Commit Metadata Topic Part. Offset Send Kafka Protocol - Producer
Broker 91 2 4 5 6 7 83 Consumer Partition 0 Kafka Protocol - Consumer ● Subscribe (topic) & poll ● Reads topic-partition-offset ● Order is guaranteed only within a partition ● Data is kept only for a limited time (configurable) ● Numbers represents offsets not messages ● Deserialize data Topic A
Broker Kafka Consumer Group 91 2 4 5 6 7 83 1 3 4 5 6 72 1 2 3 5 64 1 2 3 4 6 7 85 Consumer 2 Consumer 1 Consumer 0 Partition 0 Partition 1 Partition 2 Partition 3 Kafka Protocol - Consumer Topic A
Topic Consumer Group Partition 0 Consumer 0 Partition 1 Partition 2 Partition 3 Consumer 1 Consumer 2 Consumer 3 Topic Consumer Group Partition 0 Consumer 0 Partition 1 Partition 2 Partition 3 Consumer 1 Topic Consumer Group Partition 0 Consumer 0 Partition 1 Consumer 1 Consumer 2 Consumer 3 Kafka Protocol - Consumer
Consumer Group Kafka Protocol - Consumer Producer Producer Broker aemet Producer Producer Server aemet google places 2 Exchange google places google places 1 Consumer Consumer Consumer Consumer Consumer Group Consumer Consumer Consumer Consumer AMQP Kafka ● Consumes topic ● High throughput scenario ● Consumes queue ● Low latency scenario
Kafka Ecosystem
Kafka Connect Make it easy to add new systems to your scalable and secure stream data pipelines Kafka Connect is a framework included in Apache Kafka that integrates Kafka with other systems KafkaSourceConnect KafkaSinkConnect
Kafka Connect Task Conector Config Properties Standalone Distributed Worker Kafka Producer Kafka Consumer Worker TaskThread
Schema Registry Task stream stream stream Worker Conector Source record Source record Source record sendRecords() Task Worker Schema Registry Converter Producer recordProducer record Producer record schema id Converter id schema Subject topic schema version Subject topic schema version Subject topic schema version Producer recordProducer record Consumer record pollConsumer() Sink record Sink record Sink record Connector
Installation
Configuration ✓ zookeeper.properties ✓ server.properties ✓ schema-registry.properties ✓ connect-distributed.properties ✓ connect-standalone.properties ✓ kafka-rest.properties Zookeeper Kafka Server Kafka Connect Schema Registry Kafka rest
Demo
Demo Fuentes Kafka Producer Single Broker - Single Instance Kafka Connect Schema Registry

Introduction to Apache Kafka

  • 2.
    ❏ Introduction ❏ Firstuse case: Linkedin ❏ Architecture ❏ Main components ❏ Kafka protocol ❏ Kafka ecosystem ❏ Kafka Connect ❏ Schema Registry ❏ Installation/Configuration ❏ Datalab Use Cases ❏ Demo Contents
  • 4.
  • 5.
    Introduction Data hierarchy ofneeds Vision mission Products Data Science Data Infrastructure Data Access
  • 6.
  • 7.
    Introduction - DataIngestion script to read data aggregation script aggregation script Tweet fetch script script to read data api rest script script to read data
  • 8.
    Introduction - DataIngestion script to read data aggregation script aggregation script Tweet fetch script script to read data api rest script script to read data
  • 9.
    Introduction - DataIngestion Take something in Absorb something
  • 11.
    Linkedin data pipelineproblem They had a lot of data ● User activity tracking ● Server logs and metrics ● Messaging ● Analytics They Build products on data ● Newsfeed ● Recommendation ● Search ● Metrics and monitoring Problem How to integrate this variety of data and make it available to all their products?
  • 12.
    Frontend Server Metrics Server Inter-processcommunication channel Linkedin data pipeline problem
  • 13.
    Many publisher usingdirect connections Frontend Server Frontend Server Database Server Chat Server Metrics Analysis Metrics UI Database MonitorActive Monitoring Backend server Linkedin data pipeline problem
  • 14.
    Publish/subscribe system Frontend Server FrontendServer Database Server Chat Server Metrics Analysis Metrics UI Database MonitorActive Monitoring Backend server Metrics pub/sub Linkedin data pipeline problem
  • 15.
  • 16.
    Log Search Multiple publish/subscribesystems Frontend Server Frontend Server Database Server Chat Server Metrics Analysis Metrics UI Database Monitor Active Monitoring Backend server Metrics pub/sub Log Search Offline processing Logging pub/sub Tracking pub/sub Linkedin data pipeline problem
  • 17.
    Log Search Linkedin datapipeline problem Custom infrastructure for the data pipeline Frontend Server Frontend Server Database Server Chat Server Metrics Analysis Metrics UI Database Monitor Active Monitoring Backend server Metrics pub/sub Log Search Offline processing Logging pub/sub Tracking pub/sub
  • 18.
    Log Search Frontend ServerFrontendServer Database Server Chat Server Metrics Analysis Metrics UI Database Monitor Backend server Log Search Offline processing ● Decouple data pipelines ● Provide persistence for message data to allow multiple consumers ● Optimize for high throughput of messages ● Allow for horizontal scaling of the system to grow as the data stream grow Linkedin data pipeline problem - Kafka Goals
  • 20.
    Log Search Kafka Architecture- Elements Frontend ServerFrontend Server Producer Producer Metrics Analysis Consumer Consumer Producer Log Search Consumer Kafka Cluster Kafka → distributed, replicated commit log Broker Partition X / Topic Y
  • 21.
    commit log Kafka Architecture- Broker Producer Consumer 1Consumer 2 Read at offset Read at offset
  • 22.
    Kafka Cluster Kafka Architecture- Broker Broker 1 Broker 2 Broker 3 Partition 0 / Topic B Partition 0 / Topic C Partition 0 / Topic A Partition 0 / Topic B Partition 1 / Topic A distributed replicated
  • 23.
    Kafka Architecture -Producer/Consumer Log Search Frontend Server Producer Consumer Kafka Cluster Basic Concepts ● Latency ● Throughput ● Quality of service: at most once, at least once, exactly once Use Case Requirements o Quality of service / Latency o Throughput / Latency Producer/Consumer Technology o Ingestion Technologies o Kafka Client API o Kafka Connect
  • 24.
    Kafka Architecture -Producer / Consumer
  • 25.
    Kafka Architecture -Producer/Consumer API Log Search Frontend Server Producer Consumer Kafka Cluster
  • 26.
  • 27.
    Kafka Producer Kafka Protocol- Producer Producer Record Topic [Partition] [Key] Value Broker Partition 0 / Topic A Producer.send (record) exception/metadata
  • 28.
    Productor Kafka Topic /Partition Buffer Sender Thread Producer Record Topic [Partition] [Key] Value Serializer Partitioner Topic A / Partition 0 Batch 0 Batch 1 Batch 0 / Topic A / Partition 0 Batch 0 / Topic B / Partition 0 Batch 0 / Topic B / Partition 1 Batch 1 / Topic B / Partition 0 Retry Fail Yes Yes No NoException Metadata Topic Partition X Partition Commit Metadata Topic Part. Offset Send Kafka Protocol - Producer
  • 29.
    Broker 91 2 45 6 7 83 Consumer Partition 0 Kafka Protocol - Consumer ● Subscribe (topic) & poll ● Reads topic-partition-offset ● Order is guaranteed only within a partition ● Data is kept only for a limited time (configurable) ● Numbers represents offsets not messages ● Deserialize data Topic A
  • 30.
    Broker Kafka ConsumerGroup 91 2 4 5 6 7 83 1 3 4 5 6 72 1 2 3 5 64 1 2 3 4 6 7 85 Consumer 2 Consumer 1 Consumer 0 Partition 0 Partition 1 Partition 2 Partition 3 Kafka Protocol - Consumer Topic A
  • 31.
    Topic Consumer Group Partition 0Consumer 0 Partition 1 Partition 2 Partition 3 Consumer 1 Consumer 2 Consumer 3 Topic Consumer Group Partition 0 Consumer 0 Partition 1 Partition 2 Partition 3 Consumer 1 Topic Consumer Group Partition 0 Consumer 0 Partition 1 Consumer 1 Consumer 2 Consumer 3 Kafka Protocol - Consumer
  • 32.
    Consumer Group Kafka Protocol- Consumer Producer Producer Broker aemet Producer Producer Server aemet google places 2 Exchange google places google places 1 Consumer Consumer Consumer Consumer Consumer Group Consumer Consumer Consumer Consumer AMQP Kafka ● Consumes topic ● High throughput scenario ● Consumes queue ● Low latency scenario
  • 33.
  • 34.
    Kafka Connect Make iteasy to add new systems to your scalable and secure stream data pipelines Kafka Connect is a framework included in Apache Kafka that integrates Kafka with other systems KafkaSourceConnect KafkaSinkConnect
  • 35.
  • 36.
    Schema Registry Task stream stream stream Worker Conector Source record Sourcerecord Source record sendRecords() Task Worker Schema Registry Converter Producer recordProducer record Producer record schema id Converter id schema Subject topic schema version Subject topic schema version Subject topic schema version Producer recordProducer record Consumer record pollConsumer() Sink record Sink record Sink record Connector
  • 37.
  • 38.
    Configuration ✓ zookeeper.properties ✓ server.properties ✓schema-registry.properties ✓ connect-distributed.properties ✓ connect-standalone.properties ✓ kafka-rest.properties Zookeeper Kafka Server Kafka Connect Schema Registry Kafka rest
  • 39.
  • 40.
    Demo Fuentes Kafka Producer Single Broker -Single Instance Kafka Connect Schema Registry