Onyx - Data Processing The Clojure Way By Bahadir Cambel @bahadircambel
Raise your hands if you have used - Cascalog - Hadoop - Spark - Flink - Samza - Storm - Sqoop
What’s good for ? Realtime event stream processing Continuous computation Extract, transform, load (ETL) Data transformation à la map-reduce Data ingestion and storage medium transfer Data cleaning
Data Structures + Simple Functions - Sounds familiar ?
Hadoop
Cascading https://github.com/Cascading/cascading.samples/blob /master/wordcount/src/java/wordcount/Main.java
Storm https://github.com/nathanmarz/storm- starter/blob/master/src/clj/storm/starte r/clj/word_count.clj
Spark http://spark.apache.org/examples.html SCALA JAVA dev having a productive day
Cascalog http://cascalog.org/articles/getting_st arted.html ?- <- :>
Internet Scale MongoDB could save youu?!
Onyx
Workflow - articulate the paths that data flows through the cluster at runtime - DAG
Catalog - Describe and configure workflow items
Function(s)
Flow Conditions From -> To ( if predicate correct) Flow conditions are used for isolating logic about whether or not segments should pass through different tasks in a workflow, exception handling and support a rich degree of composition with runtime parameterization.
Windows / Triggers partitions a possible unbounded sequence of data into finite pieces, allowing aggregations to be specified - Timer - Segment - Punctuation - Watermark
Life Cycles allows you to hook in and execute arbitrary code at critical points during a task (kinda middleware) :lifecycle/start-task? :lifecycle/before-task-start :lifecycle/before-batch :lifecycle/after-read-batch :lifecycle/after-batch :lifecycle/after-task-stop :lifecycle/after-ack-segment :lifecycle/after-retry-segment
Job A job will be translated into multiple tasks. Peers will take care of these tasks. If your number of tasks > available peers A job won’t be complete ( Buy me a beer or 10)
Bulk functions perform a fn more efficiently over a batch of segments rather than processing one segment at a time. - Write to DB Onyx will ignore the output of your function and pass the same segments that you received downstream
Group by “like” values are always routed to the same virtual peer - Group by key - Group by a fn Specify in the catalog!
Fixed Windows a data point will fall into exactly one instance of a window (often called an extent in the literature) Between t1=0 and t2=4 how many events have happened? t1=5 t2=9, t1=10 t2=14 And so on..
Sliding Window a slide value for how long to wait between spawning a new window extent Between t1=0 and t2=14 how many events have happened? t1=5 t2=19 ? t1=10 t2=24 ?
Global Window
Session Window dynamically resize their upper and lower bounds in reaction to incoming data Sessions capture a time span of activity for a specific key, such as a user ID. If no activity occurs within a timeout gap, the session closes. If an event occurs within the bounds of a session, the window size is fused with the new event, and the session is extended by its timeout gap either in the forward or backward direction
Aggregation :onyx.windowing.aggregation/conj :onyx.windowing.aggregation/count :onyx.windowing.aggregation/sum :onyx.windowing.aggregation/min :onyx.windowing.aggregation/max :onyx.windowing.aggregation/average
Architecture
Peer is a node in the cluster responsible for processing data
Virtual Peer A Virtual Peer refers to a single peer process running on a single physical machine. A single Virtual Peer executes at most one task at a time.
ZooKeeper Watches Peers
Aeron Efficient reliable UDP unicast, UDP multicast, and IPC message transport Messaging layer takes care of the direct peer to peer transfer of segment batches, acks, segment completion and segment retries to the relevant virtual peers.
The Log
Scheduling If there is no master, how does scheduling work ? Peers contend to work on tasks.
Types of Job Schedulers - Greedy ( I need ALL!!!! Gimme all!!) - Balanced Robin ( Fair play) - Percentage ( Not so fair play)
Types of Task Schedulers - Balanced - Percentage - Colocation (assigns them to the peers on a single physical machine, low latency, min network)
Tags a set of machines in your cluster are privileged Run some tasks at some specific machines Declare a peer with capabilities - Datomic - Special Hardware (GPU, Memory) - Network
Onyx Plugins onyx-core-async onyx-kafka onyx-datomic onyx-redis onyx-sql onyx-bookkeeper onyx-seq onyx-durable-queue onyx-elasticsearch
Example - Let’s process some logs 49556677821280438558577372995495836672945903576549425154
Check out the repository https://github.com/bcambel/onyx- test
End users configuring what - workflows should look like. - Language agnostic - Location agnostic - Tolerant to machine generation - Temporally agnostic ( should wait for a time to be realized)
If you are not enjoying your experience There is something fundamentally wrong with the tool Think about Apple’s smooth product experience. Pain detected, thought through. (Pain -> Pleasure)
Tools - Onyx ETL https://github.com/onyx-platform/onyx-etl - Dashboard https://github.com/onyx-platform/onyx-dashboard - Replica Cons https://github.com/onyx-platform/onyx-console-dashboard - Ansible Playbook https://github.com/onyx-platform/ansible-onyx - Metrics Suite https://github.com/onyx-platform/onyx-metrics - Benchmark Suite https://github.com/onyx-platform/onyx-benchmark - Jepsen https://github.com/onyx-platform/onyx-jepsen
Onyx Dashboard https://github.com/onyx- platform/onyx-dashboard
Questions - How does Onyx distributes reads (input tasks) ? Parallelization?? - evenly break up a database table into chunks which can be read by multiple peers - Segments realized. Tasks created. Peers get into action
Helpful Links http://www.onyxplatform.org/ https://gitter.im/onyx-platform/onyx https://clojurians.slack.com/messages/onyx https://github.com/onyx-platform/onyx-examples https://github.com/onyx-platform/learn-onyx https://github.com/Yuppiechef/cqrs-server (Open source project using Onyx)
Shameless self plug http://www.bahadir.io/ https://twitter.com/bahadircambel https://www.strava.com/athletes/8258974
Thanks Michael Drogalis - https://twitter.com/michaeldrogalis Lucas Bradstreet - https://twitter.com/ghaz
Thanks for using Clojure!

Onyx data processing the clojure way