Real-Time log analysis with Mesos, Docker, Kafka, Spark, Cassandra and Solr at scale
whoami CEO of Elodina http://www.elodina.net/ a big data as a service platform built on top open source software. The Elodina platform enables customers to analyze data streams and programmatically react to the results in real-time. We solve today’s data analytics needs by providing the tools and support necessary to utilize open source technologies. As users, contributors and committers, Elodina also provides support for frameworks that run on Mesos including Apache Kafka, Exhibitor (Zookeeper), Apache Storm, Apache Cassandra and a whole lot more! Apache Kafka Committer & PMC Member LinkedIn: http://linkedin.com/in/charmalloc Twitter : @allthingshadoop 2© 2015. All Rights Reserved.
1 Intro To Mesos, Kafka, Etc 2 Architecture Overview 3 Breaking it down into pieces 4 Questions? 3© 2015. All Rights Reserved.
Apache Mesos 4© 2015. All Rights Reserved.
Mesos Papers Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center http://static.usenix.org/event/nsdi11/tech/full_papers/Hindman_new.pdf Google Borg - https://research.google.com/pubs/pub43438.html Google Omega: flexible, scalable schedulers for large compute clusters http: //eurosys2013.tudos.org/wp-content/uploads/2013/paper/Schwarzkopf.pdf 5
Static Partitioning 6
Static Partitioning 7
Static Partitioning 8
Static Partitioning 9
Fine Grained Resource Elasticity "If people knew how low it really is, we’d all get fired." https://gigaom.com/2013/11/30/the-sorry-state-of-server-utilization-and-the-impending-post-hypervisor-era/ 10
An operating system for your data center 11
EVERYTHING ON MESOS 12
How it works 13
Marathon 14 https://github.com/mesosphere/marathon Cluster-wide init and control system for services in cgroups or docker based on Apache Mesos
Docker on Marathon { "id": "basic-3", "cmd": "python3 -m http.server 8080", "cpus": 0.5, "mem": 32.0, "container": { "type": "DOCKER", "docker": { "image": "python:3", "network": "BRIDGE", "portMappings": [ { "containerPort": 8080, "hostPort": 0 } ] } } } 15
Apache Kafka 16
Kafka papers Apache Kafka was first open sourced by LinkedIn in 2011 Papers ● Building a Replicated Logging System with Apache Kafka http://www.vldb.org/pvldb/vol8/p1654-wang.pdf ● Kafka: A Distributed Messaging System for Log Processing http://research.microsoft.com/en- us/um/people/srikanth/netdb11/netdb11papers/netdb11-final12.pdf ● Building LinkedIn’s Real-time Activity Data Pipeline http://sites.computer.org/debull/A12june/pipeline.pdf ● The Log: What Every Software Engineer Should Know About Real-time Data's Unifying Abstraction http://engineering. linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying http://kafka.apache.org/ 17
How Big Data Starts 18
More Big Data! More! 19
uhhhh 20
eeesh 21
Kafka de-couples data pipelines 22
Distributed Replicated Log Read & Write In real time As much as you want As fast as your network 23
Reference Architecture 24
Producers syslog → Kafka via docker https://hub.docker.com/r/stealthly/syslog/ syslog → Kafka scheduler https://github.com/stealthly/syslog-service statsd → Kafka scheduler https://github.com/stealthly/statsd-mesos-kafka system stats collection → Kafka scheduler https://github.com/stealthly/syscol tailf → Kafka https://github.com/stealthly/go_kafka_client/tree/master/producers/tailf Any language https://cwiki.apache.org/confluence/display/KAFKA/Clients 25
Reference Architecture 26
Kafka on Mesos https://github.com/mesos/kafka 27
Kafka on Mesos ● smart broker.id assignment. ● preservation of broker placement (through constraints and/or new features). ● ability to-do configuration changes. ● rolling restarts (for things like configuration changes). ● scaling the cluster up and down with automatic, programmatic and manual options. ● smart partition assignment via constraints visa vi roles, resources and attributes. 28
CLI & REST API ● scheduler - starts the scheduler. ● broker ○ add - adds one more more brokers to the cluster. ○ update - changes resources, constraints or broker properties one or more brokers. ○ remove - take a broker out of the cluster. ○ start - starts a broker up. ○ stop - this can either a graceful shutdown or will force kill it (./kafka-mesos.sh help stop) ● topic ○ list - list topics in cluster ○ add - add new topics in cluster ○ update - change topics in cluster ○ rebalance - allows you to rebalance a cluster either by selecting the brokers or topics to rebalance. Manual assignment is still possible using the Apache Kafka project tools. Rebalance can also change the replication factor on a topic. ● help - ./kafka-mesos.sh help || ./kafka-mesos.sh help {command} 29
Reference Architecture 30
Schema Avro or ProtoBuff - https://github.com/stealthly/go_kafka_client/blob/master/syslog/syslog_proto/logline.proto - https://github.com/stealthly/go_kafka_client/blob/master/logline.avsc logline • line • logtypeid • source • tags (k/v pairs) • timings (k/v pairs) 31
Consume from Kafka → Write to Cassandra Implement CQL write here https://github. com/stealthly/go_kafka_client/blob/master/consumers/consum ers.go#L186-L194 with https://github.com/gocql/gocql Go Kafka Client does fan out work processing, rebalance doesn’ t upset consumers that are reading already. 32
Reference Architecture 33
Sample Spark Job → Cassandra https://github.com/stealthly/gauntlet Uses the Cassandra Spark Connector https://github. com/datastax/spark-cassandra-connector 34
Use DataStax Enterprise to enable Search http://docs.datastax.com/en/datastax_enterprise/4. 8/datastax_enterprise/srch/srchOverview.html 35
Questions? 36 http://www.elodina.net
Thank you

Big Data Open Source Security LLC: Realtime log analysis with Mesos, Docker, Kafka, Spark, Cassandra and Solr at scale

  • 1.
    Real-Time log analysiswith Mesos, Docker, Kafka, Spark, Cassandra and Solr at scale
  • 2.
    whoami CEO of Elodinahttp://www.elodina.net/ a big data as a service platform built on top open source software. The Elodina platform enables customers to analyze data streams and programmatically react to the results in real-time. We solve today’s data analytics needs by providing the tools and support necessary to utilize open source technologies. As users, contributors and committers, Elodina also provides support for frameworks that run on Mesos including Apache Kafka, Exhibitor (Zookeeper), Apache Storm, Apache Cassandra and a whole lot more! Apache Kafka Committer & PMC Member LinkedIn: http://linkedin.com/in/charmalloc Twitter : @allthingshadoop 2© 2015. All Rights Reserved.
  • 3.
    1 Intro ToMesos, Kafka, Etc 2 Architecture Overview 3 Breaking it down into pieces 4 Questions? 3© 2015. All Rights Reserved.
  • 4.
    Apache Mesos 4© 2015.All Rights Reserved.
  • 5.
    Mesos Papers Mesos: APlatform for Fine-Grained Resource Sharing in the Data Center http://static.usenix.org/event/nsdi11/tech/full_papers/Hindman_new.pdf Google Borg - https://research.google.com/pubs/pub43438.html Google Omega: flexible, scalable schedulers for large compute clusters http: //eurosys2013.tudos.org/wp-content/uploads/2013/paper/Schwarzkopf.pdf 5
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
    Fine Grained ResourceElasticity "If people knew how low it really is, we’d all get fired." https://gigaom.com/2013/11/30/the-sorry-state-of-server-utilization-and-the-impending-post-hypervisor-era/ 10
  • 11.
    An operating systemfor your data center 11
  • 12.
  • 13.
  • 14.
    Marathon 14 https://github.com/mesosphere/marathon Cluster-wide init andcontrol system for services in cgroups or docker based on Apache Mesos
  • 15.
    Docker on Marathon { "id":"basic-3", "cmd": "python3 -m http.server 8080", "cpus": 0.5, "mem": 32.0, "container": { "type": "DOCKER", "docker": { "image": "python:3", "network": "BRIDGE", "portMappings": [ { "containerPort": 8080, "hostPort": 0 } ] } } } 15
  • 16.
  • 17.
    Kafka papers Apache Kafkawas first open sourced by LinkedIn in 2011 Papers ● Building a Replicated Logging System with Apache Kafka http://www.vldb.org/pvldb/vol8/p1654-wang.pdf ● Kafka: A Distributed Messaging System for Log Processing http://research.microsoft.com/en- us/um/people/srikanth/netdb11/netdb11papers/netdb11-final12.pdf ● Building LinkedIn’s Real-time Activity Data Pipeline http://sites.computer.org/debull/A12june/pipeline.pdf ● The Log: What Every Software Engineer Should Know About Real-time Data's Unifying Abstraction http://engineering. linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying http://kafka.apache.org/ 17
  • 18.
    How Big DataStarts 18
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
    Distributed Replicated Log Read& Write In real time As much as you want As fast as your network 23
  • 24.
  • 25.
    Producers syslog → Kafkavia docker https://hub.docker.com/r/stealthly/syslog/ syslog → Kafka scheduler https://github.com/stealthly/syslog-service statsd → Kafka scheduler https://github.com/stealthly/statsd-mesos-kafka system stats collection → Kafka scheduler https://github.com/stealthly/syscol tailf → Kafka https://github.com/stealthly/go_kafka_client/tree/master/producers/tailf Any language https://cwiki.apache.org/confluence/display/KAFKA/Clients 25
  • 26.
  • 27.
  • 28.
    Kafka on Mesos ●smart broker.id assignment. ● preservation of broker placement (through constraints and/or new features). ● ability to-do configuration changes. ● rolling restarts (for things like configuration changes). ● scaling the cluster up and down with automatic, programmatic and manual options. ● smart partition assignment via constraints visa vi roles, resources and attributes. 28
  • 29.
    CLI & RESTAPI ● scheduler - starts the scheduler. ● broker ○ add - adds one more more brokers to the cluster. ○ update - changes resources, constraints or broker properties one or more brokers. ○ remove - take a broker out of the cluster. ○ start - starts a broker up. ○ stop - this can either a graceful shutdown or will force kill it (./kafka-mesos.sh help stop) ● topic ○ list - list topics in cluster ○ add - add new topics in cluster ○ update - change topics in cluster ○ rebalance - allows you to rebalance a cluster either by selecting the brokers or topics to rebalance. Manual assignment is still possible using the Apache Kafka project tools. Rebalance can also change the replication factor on a topic. ● help - ./kafka-mesos.sh help || ./kafka-mesos.sh help {command} 29
  • 30.
  • 31.
    Schema Avro orProtoBuff - https://github.com/stealthly/go_kafka_client/blob/master/syslog/syslog_proto/logline.proto - https://github.com/stealthly/go_kafka_client/blob/master/logline.avsc logline • line • logtypeid • source • tags (k/v pairs) • timings (k/v pairs) 31
  • 32.
    Consume from Kafka→ Write to Cassandra Implement CQL write here https://github. com/stealthly/go_kafka_client/blob/master/consumers/consum ers.go#L186-L194 with https://github.com/gocql/gocql Go Kafka Client does fan out work processing, rebalance doesn’ t upset consumers that are reading already. 32
  • 33.
  • 34.
    Sample Spark Job→ Cassandra https://github.com/stealthly/gauntlet Uses the Cassandra Spark Connector https://github. com/datastax/spark-cassandra-connector 34
  • 35.
    Use DataStax Enterpriseto enable Search http://docs.datastax.com/en/datastax_enterprise/4. 8/datastax_enterprise/srch/srchOverview.html 35
  • 36.
  • 37.