The Best of Apache Kafka Architecture Ranganathan Balashanmugam @ran_than Apache: Big Data 2015
Helló Budapest
About Me ❏ Graduated as Civil Engineer. ❏ <dev> 10+ years </dev> ❏ <Thoughtworker from=”India”/> ❏ Organizer of Hyderabad Scalability Meetup with 2000+ members.
“Form follows function.” - Louis Sullivan
Gravity Dam Indirasagar Dam, India img src: http://www.montanhydraulik.in
Forces on a gravity dam Dam weight Head Water Tail Water Uplift
❏ publish-subscribe messaging service ❏ distributed commit/write-ahead log “producers produce, consumers consume, in large distributed reliable way -- real time”
❏ DBs ❏ Logs ❏ Brokers ❏ HDFS “For highly distributed messages, Kafka stands out.” Why Kafka?
Kafka Vs ________ src: https://softwaremill.com/mqperf/
Timeline 2011 2012 2013 2014 2015 Open sourced by LinkedIn, as version 0.6 Graduated from Apache Latest stable - 0.8.2.1 Several Engineers who built Kakfa create Confluent
A Kafka Message CRC attributes key length key message message length message content kafka.message.Message magic Change requested:KAFKA-2511
Producers - push Kafka Broker org.apache.kafka.clients.producer.KafkaProducer Response => [TopicName [Partition ErrorCode Offset]] Request => RequiredAcks Timeout [TopicName [Partition MessageSetSize MessageSet]]
Topic number of messages time size Remove messages based on kafka.common.Topic
Partitions kafka.cluster.Partition Serves: Horizontal scaling, Parallel consumer reads
Consumers - pull kafka.consumer.ConsumerConnector, kafka.consumer.SimpleConsumer Consumer 1 Consumer 2
Consumer offsets committing and fetching consumer offsets img src: http://www.reynanprinting.com/photos/undefined/impresion-offset1.jpg
kafka:// - protocol ● Metadata ● Send ● Fetch ● Offsets ● Offset commit ● Offset fetch “Binary protocol over TCP”
Mechanical Sympathy "The most amazing achievement of the computer software industry is its continuing cancellation of the steady and staggering gains made by the computer hardware industry." - Henry Peteroski Image source: http://www.theguide2surrey.com
Persistence “Everything is faster till the disk IO.”
Disk faster than RAM src: http://queue.acm.org/detail.cfm?id=1563874
Linear Read & Writes On high level there are only two operations: Append to end of log fetch messages from a partition beginning from a particular message id sequential file I/O
“Let us play pictionary”
Linux Page Cache “Kafka ate my RAM”
ZeroCopy src: http://www.ibm.com/developerworks/library/j-zerocopy/
Batching small latency to improve throughput img src: https://prashanthpanduranga.files.wordpress.com/2015/05/tirupati.jpg
Compression bandwidth is more expensive per-byte to scale than disk I/O, CPU, or network bandwidth capacity within a facility kafka.message.CompressionCodec
Log compaction img src: http://kafka.apache.org/083/documentation.html kafka.log.LogCleaner, LogCleanerManager
Message Delivery Atleast once Atmost once Exactly once
Replication un-replicated = replication factor of one
Quorum based ● Better latency ● To tolerate “f” failures, need “2f+1” replicas
Primary-backup replication Broker 1 Broker 2 Broker 3 Broker 4 Topic 1 Topic 1 Topic 1 Topic 2 Topic 2 Topic 2 Topic 3 Topic 3Topic 3
ZooKeeper cluster coordinator
THANK YOU For questions or suggestions: Ran.ga.na.than B ranganab@thoughtworks.com @ran_than

The best of Apache Kafka Architecture