Apache Kafka at Rocana Persistent Machine Data Collection at Scale
© 2015 Rocana, Inc. All Rights Reserved. Who am I? 2 Platform Engineer Based in Ottawa alan@rocana.com @alanctgardner
© 2015 Rocana, Inc. All Rights Reserved. Working at Rocana 3
© 2015 Rocana, Inc. All Rights Reserved. Rocana Ops 4
© 2015 Rocana, Inc. All Rights Reserved. Kafka Principles
© 2015 Rocana, Inc. All Rights Reserved. History 6 • Designed at LinkedIn • Documented in a 2013 blog post by Jay Kreps • LinkedIn moved from a monolith to multiple data stores and services https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
© 2015 Rocana, Inc. All Rights Reserved. Complexity 7 https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
© 2015 Rocana, Inc. All Rights Reserved. Complexity 8
© 2015 Rocana, Inc. All Rights Reserved. Centralized Data Bus 9 https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
© 2015 Rocana, Inc. All Rights Reserved.10 Centralized Data Bus
© 2015 Rocana, Inc. All Rights Reserved. Design Goals 11 • A centralized data bus that: • Scales horizontally • Delivers (some) events in order • Decouples producers and consumers • Has low latency end-to-end
© 2015 Rocana, Inc. All Rights Reserved. A Horizontally Scalable Log 12 https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
© 2015 Rocana, Inc. All Rights Reserved. Asynchronous Consumers 13
© 2015 Rocana, Inc. All Rights Reserved. Low-Latency, Durable Writes 14 • Kafka writes all events to disk • Events are stored on disk in the wire protocol • Zero-copy reads and writes avoid events ever entering user space • Kafka relies on the page cache for low-latency serving of recent events
© 2015 Rocana, Inc. All Rights Reserved. Putting it all together 15
© 2015 Rocana, Inc. All Rights Reserved. Our Experience
© 2015 Rocana, Inc. All Rights Reserved. 17
© 2015 Rocana, Inc. All Rights Reserved.18 Resource Constraints • Customer machines are doing real work • Agent footprint must be small • Can’t depend on availability of back- end services • Batching is crucial
© 2015 Rocana, Inc. All Rights Reserved.19 Independent consumers • Consumers aren’t coupled to each other • Maintenance and upgrades are simplified • Horizontal scale per consumer
© 2015 Rocana, Inc. All Rights Reserved. Vendor Support 20
© 2015 Rocana, Inc. All Rights Reserved.21
© 2015 Rocana, Inc. All Rights Reserved. 22 Shamelessly stolen from https://aphyr.com/
© 2015 Rocana, Inc. All Rights Reserved. 23 { “syslog_arrival_ts":"1444489076463", "syslog_conn_dns":"localhost", “syslog_conn_port":"57788", “body”: …, “id”:”KLE5GZF7WB2WSA5…”, … } { “tailed_file_inode”:”2371810", “tailed_file_offset”:"384930", “timestamp":"", “body”: …, “id”:”73XXMLRJNHKA76…”, … } Ephemeral Source Durable Source
© 2015 Rocana, Inc. All Rights Reserved. Durability 24
© 2015 Rocana, Inc. All Rights Reserved. Unclean Elections 25 • Kafka maintains a set of up-to-date replicas in ZK • “In-sync replicas” or the ISR • ISR can dynamically grow or shrink • by default Kafka will accept writes with a single ISR • It is possible for the set to shrink to 0 nodes, which either leads to: • partition unavailability until an in-sync replica returns to life • OR data loss when an out-of-sync node begins accepting writes • This is tunable with the “unclean leader election” property • Defaults to true in 0.8.2 http://blog.empathybox.com/post/62279088548/a-few-notes-on-kafka-and-jepsen
© 2015 Rocana, Inc. All Rights Reserved.26
© 2015 Rocana, Inc. All Rights Reserved. Schema Versioning 27 http://www.confluent.io/blog/schema-registry-kafka-stream-processing-yes-virginia-you-really-need-one • Schemas are absolutely necessary • Have a plan for how to evolve the schema before v1 • A schema registry is a good investment
© 2015 Rocana, Inc. All Rights Reserved. Security 28 • No encryption or authentication in 0.8.x • stunnel, encryption at the app layer are possible • Should be fixed in 0.9.0
© 2015 Rocana, Inc. All Rights Reserved. Replication 29 • Cross-DC clusters are not recommended • Kafka includes MirrorMaker for replication between two clusters • Replication is asynchronous • Offsets aren’t consistent
© 2015 Rocana, Inc. All Rights Reserved. Operations 30 • Everything is manual: • Rebalancing partitions • Rebalancing leaders • Decomissioning nodes • Watch for lagging consumers
© 2015 Rocana, Inc. All Rights Reserved. Sizing 31 • Consider both throughput and retention time • Overprovision number of partitions • Rebalancing is easy, but re-sharding breaks consistent hashing http://www.confluent.io/blog/how-to-choose-the-number-of-topicspartitions-in-a-kafka-cluster/
© 2015 Rocana, Inc. All Rights Reserved. Performance 32 • Jay Kreps ran an on-premises benchmark • 18 spindles, 18 cores in 3 boxes could produce 2.5M events/sec • Aggressive batching is necessary • Synchronous ACKs halve throughput https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines
© 2015 Rocana, Inc. All Rights Reserved. Performance 33 • Reproduced on AWS with 3 and 5 node clusters • d2.xlarge nodes have 3 spindles, 4 cores, 30.5GB RAM • 5 producers on m3.xlarge instances • 3 nodes accepted 2.6M events/s • 24 partitions, one replica, one ACK • dropped to 1.7M with 3x replication and 1 ack • 5 nodes accepted 3.6M events/s • 48 partitions, one replica, one ACK • dropped to 2.16M with 3x replication and 1 ack
© 2015 Rocana, Inc. All Rights Reserved. Thank You! 34

DataEngConf: Apache Kafka at Rocana: a scalable, distributed log for machine data

  • 1.
    Apache Kafka atRocana Persistent Machine Data Collection at Scale
  • 2.
    © 2015 Rocana,Inc. All Rights Reserved. Who am I? 2 Platform Engineer Based in Ottawa alan@rocana.com @alanctgardner
  • 3.
    © 2015 Rocana,Inc. All Rights Reserved. Working at Rocana 3
  • 4.
    © 2015 Rocana,Inc. All Rights Reserved. Rocana Ops 4
  • 5.
    © 2015 Rocana,Inc. All Rights Reserved. Kafka Principles
  • 6.
    © 2015 Rocana,Inc. All Rights Reserved. History 6 • Designed at LinkedIn • Documented in a 2013 blog post by Jay Kreps • LinkedIn moved from a monolith to multiple data stores and services https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
  • 7.
    © 2015 Rocana,Inc. All Rights Reserved. Complexity 7 https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
  • 8.
    © 2015 Rocana,Inc. All Rights Reserved. Complexity 8
  • 9.
    © 2015 Rocana,Inc. All Rights Reserved. Centralized Data Bus 9 https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
  • 10.
    © 2015 Rocana,Inc. All Rights Reserved.10 Centralized Data Bus
  • 11.
    © 2015 Rocana,Inc. All Rights Reserved. Design Goals 11 • A centralized data bus that: • Scales horizontally • Delivers (some) events in order • Decouples producers and consumers • Has low latency end-to-end
  • 12.
    © 2015 Rocana,Inc. All Rights Reserved. A Horizontally Scalable Log 12 https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
  • 13.
    © 2015 Rocana,Inc. All Rights Reserved. Asynchronous Consumers 13
  • 14.
    © 2015 Rocana,Inc. All Rights Reserved. Low-Latency, Durable Writes 14 • Kafka writes all events to disk • Events are stored on disk in the wire protocol • Zero-copy reads and writes avoid events ever entering user space • Kafka relies on the page cache for low-latency serving of recent events
  • 15.
    © 2015 Rocana,Inc. All Rights Reserved. Putting it all together 15
  • 16.
    © 2015 Rocana,Inc. All Rights Reserved. Our Experience
  • 17.
    © 2015 Rocana,Inc. All Rights Reserved. 17
  • 18.
    © 2015 Rocana,Inc. All Rights Reserved.18 Resource Constraints • Customer machines are doing real work • Agent footprint must be small • Can’t depend on availability of back- end services • Batching is crucial
  • 19.
    © 2015 Rocana,Inc. All Rights Reserved.19 Independent consumers • Consumers aren’t coupled to each other • Maintenance and upgrades are simplified • Horizontal scale per consumer
  • 20.
    © 2015 Rocana,Inc. All Rights Reserved. Vendor Support 20
  • 21.
    © 2015 Rocana,Inc. All Rights Reserved.21
  • 22.
    © 2015 Rocana,Inc. All Rights Reserved. 22 Shamelessly stolen from https://aphyr.com/
  • 23.
    © 2015 Rocana,Inc. All Rights Reserved. 23 { “syslog_arrival_ts":"1444489076463", "syslog_conn_dns":"localhost", “syslog_conn_port":"57788", “body”: …, “id”:”KLE5GZF7WB2WSA5…”, … } { “tailed_file_inode”:”2371810", “tailed_file_offset”:"384930", “timestamp":"", “body”: …, “id”:”73XXMLRJNHKA76…”, … } Ephemeral Source Durable Source
  • 24.
    © 2015 Rocana,Inc. All Rights Reserved. Durability 24
  • 25.
    © 2015 Rocana,Inc. All Rights Reserved. Unclean Elections 25 • Kafka maintains a set of up-to-date replicas in ZK • “In-sync replicas” or the ISR • ISR can dynamically grow or shrink • by default Kafka will accept writes with a single ISR • It is possible for the set to shrink to 0 nodes, which either leads to: • partition unavailability until an in-sync replica returns to life • OR data loss when an out-of-sync node begins accepting writes • This is tunable with the “unclean leader election” property • Defaults to true in 0.8.2 http://blog.empathybox.com/post/62279088548/a-few-notes-on-kafka-and-jepsen
  • 26.
    © 2015 Rocana,Inc. All Rights Reserved.26
  • 27.
    © 2015 Rocana,Inc. All Rights Reserved. Schema Versioning 27 http://www.confluent.io/blog/schema-registry-kafka-stream-processing-yes-virginia-you-really-need-one • Schemas are absolutely necessary • Have a plan for how to evolve the schema before v1 • A schema registry is a good investment
  • 28.
    © 2015 Rocana,Inc. All Rights Reserved. Security 28 • No encryption or authentication in 0.8.x • stunnel, encryption at the app layer are possible • Should be fixed in 0.9.0
  • 29.
    © 2015 Rocana,Inc. All Rights Reserved. Replication 29 • Cross-DC clusters are not recommended • Kafka includes MirrorMaker for replication between two clusters • Replication is asynchronous • Offsets aren’t consistent
  • 30.
    © 2015 Rocana,Inc. All Rights Reserved. Operations 30 • Everything is manual: • Rebalancing partitions • Rebalancing leaders • Decomissioning nodes • Watch for lagging consumers
  • 31.
    © 2015 Rocana,Inc. All Rights Reserved. Sizing 31 • Consider both throughput and retention time • Overprovision number of partitions • Rebalancing is easy, but re-sharding breaks consistent hashing http://www.confluent.io/blog/how-to-choose-the-number-of-topicspartitions-in-a-kafka-cluster/
  • 32.
    © 2015 Rocana,Inc. All Rights Reserved. Performance 32 • Jay Kreps ran an on-premises benchmark • 18 spindles, 18 cores in 3 boxes could produce 2.5M events/sec • Aggressive batching is necessary • Synchronous ACKs halve throughput https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines
  • 33.
    © 2015 Rocana,Inc. All Rights Reserved. Performance 33 • Reproduced on AWS with 3 and 5 node clusters • d2.xlarge nodes have 3 spindles, 4 cores, 30.5GB RAM • 5 producers on m3.xlarge instances • 3 nodes accepted 2.6M events/s • 24 partitions, one replica, one ACK • dropped to 1.7M with 3x replication and 1 ack • 5 nodes accepted 3.6M events/s • 48 partitions, one replica, one ACK • dropped to 2.16M with 3x replication and 1 ack
  • 34.
    © 2015 Rocana,Inc. All Rights Reserved. Thank You! 34

Editor's Notes

  • #2 I’m Alan Gardner Here to talk about our use of Apache Kafka at Rocana
  • #3 Platform engineer at Rocana Work on data ingest, storage and processing Distributed open-source systems: Hadoop, Kafka, Solr Systems programming work as well Work remotely from Ottawa, Canada This is my cat.
  • #4 Working at Rocana is great: everybody is remote very smart, very nice people quarterly onsites
  • #5 What is Rocana Ops? a platform for IT operations data designed for 10’s of thousands of servers in multiple data centers distill the entire organization’s IT infrastructure down to a single screen: “what’s wrong?” scalable collection framework - out of the box host data and app logs event data warehouse built on open source technologies and open schemas visualization, anomaly detection and machine learning to provide guided root cause analysis as opposed to a wall of graphs or pile of logs Apache Kafka is the “Enterprise Data Bus” Going to talk about why we chose Kafka in that role
  • #6 To explain why we chose Kafka, I’m going to start with how Kafka works and why it’s designed the way it is.
  • #7 Designed at LinkedIn to handle the explosion of different systems being created Jay Kreps blog post describes Kafka from first principles, including motivation Some of these images are cribbed from that post, where appropriate
  • #8 LinkedIn’s problem: lots of front-end services lots of back-end services hooking them together produces this complex spaghetti of dependencies Front-end has to be highly available and low-latency if you write synchronously, you can only be as fast as your slowest backend service
  • #10 Kafka acts as a central bus for data: every front-end service writes all events into Kafka backend services can take only the events they’re interested in data doesn’t live in Kafka forever Kafka is run as a utility within LinkedIn Solves one goal: centralized data bus, still need horizontal scale, durability
  • #11 This is much better
  • #13 Kafka is fundamentally a collection of logs events are only appended events are always consumed in the same order A single partition is a log: an ordered set of events Every event has an offset Partitions are the units of scale, like shards Log operations are constrained, so we can make them fast Example is sharding on users
  • #14 Consuming and producing are completely decoupled: consumers maintain their own logical offset, representing the last event they consumed different consumers can consume at different rates producers continue to append new events in order events are retained until an expiry time, or max log size Kafka is not a durable long-term store Consumers can go offline for extended time or start from scratch and consume all available events Events are durably written and replicated
  • #15 Kafka writes all data to disk, lots of good tricks: low-latency for recent data from the page cache data on the wire is the same as on disk no GC overhead for the page cache zero-copy ops
  • #16 This is an overview of a typical Kafka system: multiple producers, brokers and consumers each broker has ownership of a set of partitions (it’s the primary) broker lists and partition assignment are stored in ZK consumers are using ZK to store offsets here, but that’s not the only way
  • #18 Let’s revisit the Rocana architecture: thousands of agents writing into Kafka events are distributed across multiple partitions, written durably to disk multiple, separate consumers are decoupled from the producers and each other
  • #19 Resource limits on producer machines: these machines are doing real work that’s important to the business our agent needs to quickly encode events and produce them batching is important to ensure efficiency latency to write to Kafka is still very low
  • #20 Consumers don’t affect each other: each maintain their own offsets one consumer can be taken offline, can be slow, etc. with little impact upgrades are very easy a single consumer can even be rewound (theoretically) consumers can scale horizontally with the number of partitions
  • #21 Kafka has critical mass within the industry: Cloudera, Hortonworks, MapR all support it Confluent has all the designers of Kafka working on a commercial stream processing platform
  • #22 Those are all good things, but there are some sharp edges to watch out for.
  • #23 Kingsbury tire fire slide Exactly once delivery is very hard Not all of our consumers are doing something idempotent You can play back the whole partition to find the last message which was written
  • #24 Overview of a Rocana Event which would be published into Kafka: fixed fields and key-value pairs ID is a hash of an event fields, used for duplicate detection for durable sources we can use offset and inode, get 99% of the way for ephemeral sources we use arrival time + internal fields ID used for three things: assignment to a partition deduplication filter ID in Solr for idempotent inserts
  • #25 Kafka “writes every message to disk” Defaults to fsyncing every 10k messages, or every 3 seconds (at most) ACK happens when a message is written but not fsynced OK, so I’ll replicate data across multiple machines
  • #26 The default in Kafka is to continue making progress in the presence of node failures (AP): - unclean elections allow a replica which has not seen all writes to become the leader when the ISR shrinks to 0 - minimum ISR size is only 1 to accept writes by default - when a previously in-sync replica comes back, those records are lost - it can be disabled, see Jay’s blog for more discussion
  • #27 Some things aren’t hard, but you need to look out for:
  • #28 Data you put in Kafka really needs to have a schema Schemas really need to have an evolution strategy You probably want some notion of a schema registry Gwen’s post is great We use Avro, where the consumer has to know the writer schema tried to mitigate this with nullable fields, no luck
  • #29 There isn’t any. No encryption on disk or in flight, no authentication: you can use stunnel you could encrypt each byte buffer and decrypt on the client side No authentication These will probably both be fixed in 0.9.0 this month
  • #30 MirrorMaker is basically just a consumer/producer which pumps data between clusters: doesn’t preserve offsets, so consumers can’t fail over you can send events between two different sized clusters you can merge streams from two data centres
  • #31 Kafka operations are pretty basic, it comes with a giant `bin` dir full of tools: CLI for rebalancing partitions and leaders leaders and partitions rebalance on node failure adding nodes requires reasignment Decomissioning nodes is a giant pain right now Tool for lagging consumer
  • #32 Factors to consider when sizing a cluster: I/O throughput retention time frame (throughput over time) Partitions limit concurrency of consumers future growth (in terms of setting # of partitions) growing a cluster online is manual but possible in 0.8.2 growing number of partitions breaks consistent hashing! (http://www.confluent.io/blog/how-to-choose-the-number-of-topicspartitions-in-a-kafka-cluster/)
  • #33 Jay Kreps has a blog post about this, using 3 commodity broker boxes with 6 cores, 6 spindles each: it’s a little weird, he only uses 6 partitions so he never exercises all the spindles in the cluster he batches small messages really aggressively (8k batches of 100 byte messages) his is on-premises, he hits 2.5M records/sec producing and consuming requiring 3 acks for every message halved throughput
  • #34 I used a similar methodology on EC2 to get some sizing numbers: - Used 4k batch sizes, results were broadly similar (1k and 2k hurt perf) Over-provisioning partitions by 2x spindles doesn’t give benefit, but doesn’t slow down either Over-provisioning by 2x and adding 3x replication did cause slow down One partition actually hit 700k events/s, there may be coordination issues in the producer Synchronous acks were brutal, 10x performance hit, this is almost definitely due to AWS network latency Each node is ~$500/month At 250MB/sec, we’d only get ~18 hours of retention We’ve seen instances of only 12 hours of retention