Introduction to Real-time Big Data with Apache Spark
About Me https://ua.linkedin.com/in/tarasmatyashovsky
Spark Fast and general-purpose cluster computing platform for large-scale data processing
Why Spark? As of mid 2014, Spark is the most active Big Data project http://www.slideshare.net/databricks/new-direction-for-spark-in-2015-spark-summit-east Contributors per month to Spark
History
Time to Sort 100TB http://www.slideshare.net/databricks/new-direction-for-spark-in-2015-spark-summit-east
Why Spark is Faster? Spark processes data in-memory while Hadoop persists back to the disk after a map/reduce action
Powered by Spark https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark
Components Stack https://databricks.com/blog/2015/02/09/learning-spark-book-available-from-oreilly.html
Core Concepts automatically distribute data across cluster and parallelize operations performed on them
Distributed Application https://databricks.com/blog/2015/02/09/learning-spark-book-available-from-oreilly.html
Spark Core Abstraction
RDD API Transformations: • filter() • map() • flatMap() • distinct() • union() • intersection() • subtract() • etc. Actions: • collect() • reduce() • count() • countByValue() • first() • take() • top() • etc. http://www.slideshare.net/databricks/strata-sj-everyday-im-shuffling-tips-for-writing-better-spark-programs
Sample Application https://github.com/tmatyashovsky/spark-samples-jeeconf-kyiv
Requirements Analytics about Morning@Lohika events: • unique participants by companies • most loyal participants • participants by position • etc. https://github.com/tmatyashovsky/spark-samples-jeeconf-kyiv
Data Format Simple CSV files all fields are optional First Name Last Name Company Position Email Present Vladimir Tsukur GlobalLogic Tech/Team Lead flushdia@gmail.com 1 Mikalai Alimenkou XP Injection Tech Lead mikalai.alimenkou@ xpinjection.com 1 Taras Matyashovsky Lohika Software Engineer taras.matyashovsky@ gmail.com 0 https://github.com/tmatyashovsky/spark-samples-jeeconf-kyiv
Demo Time https://github.com/tmatyashovsky/spark-samples-jeeconf-kyiv
Cluster Manager Worker Driver Spark Context Executor Task Worker Executor Task http://spark.apache.org/docs/latest/cluster-overview.html Task Task Demo Explained
Structured data processing Spark SQL
Distributed collection of data organized into named columns Data Frame
Data Frame API • selecting columns • joining different data sources • aggregation, e.g. sum, count, average • filtering
Plan Optimization & Execution http://web.eecs.umich.edu/~prabal/teaching/resources/eecs582/armbrust15sparksql.pdf
Faster than RDD http://www.slideshare.net/databricks/spark-sqlsse2015public
Demo Time https://github.com/tmatyashovsky/spark-samples-jeeconf-kyiv
https://spark.apache.org/docs/latest/tuning.html
Our Spark Integration
Product Cloud-based analytics application
Use Cases • supplement Neo4j database used to store/query big dimensions • supplement RDBMS for querying of high volumes of data
Use Cases • represent existing computational graph as flow of Spark-based operations • predictive analytics based on Spark MLib component
Lessons Learned • Spark simplicity is deceptive • Each use case is unique • Be really aware: • Databricks blog • Mailing lists & Jira • Pull requests Spark is kind of magic
Spark is on a Rise
http://www.techrepublic.com/article/can-anything-dim-apache-spark/
Project Tungsten • the largest change to Spark’s execution engine since the project’s inception • focuses on substantially improving the efficiency of memory and CPU for Spark applications • sun.misc.Unsafe https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html
Thank you! Taras Matyashovsky taras.matyashovsky@gmail.com @tmatyashovsky http://www.filevych.com/
References https://www.linkedin.com/pulse/decoding-buzzwords-big-data-predictive-analytics- business-gordon http://www.ibmbigdatahub.com/infographic/four-vs-big-data http://www.thoughtworks.com/insights/blog/hadoop-or-not-hadoop http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app- models/ Learning Spark, by Holden Karau, Andy Konwinski, Patrick Wendell and Matei Zaharia (early release ebook from O'Reilly Media) https://spark-prs.appspot.com/#all https://www.gitbook.com/book/databricks/databricks-spark-knowledge-base/details http://insidebigdata.com/2015/03/06/8-reasons-apache-spark-hot/ https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark http://databricks.com/blog/2014/10/10/spark-petabyte-sort.html http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale- sorting.html http://web.eecs.umich.edu/~prabal/teaching/resources/eecs582/armbrust15sparksql.pdf http://www.slideshare.net/databricks/strata-sj-everyday-im-shuffling-tips-for-writing- better-spark-programs http://www.slideshare.net/databricks/new-direction-for-spark-in-2015-spark-summit-east http://www.slideshare.net/databricks/spark-sqlsse2015public https://spark.apache.org/docs/latest/running-on-mesos.html http://spark.apache.org/docs/latest/cluster-overview.html http://www.techrepublic.com/article/can-anything-dim-apache-spark/ http://spark-packages.org/

JEEConf 2015 - Introduction to real-time big data with Apache Spark

Editor's Notes

  • #9 Real-time, streaming Structures which could not be decomposed to key-value pairs Jobs/algorithms which do not yield to the MapReduce programming model
  • #22 Functional Programming API Drawback - limited opportunities for automatic optimization
  • #32 Cluster Manager: Standalone, Apache Mesos, Hadoop Yarn Cluster Manager should be chosen and configured properly Monitoring via web UI(s) and metrics Web UI: master web UI worker web UI driver web UI - available only during execution history server - spark.eventLog.enabled = true Metrics based on Coda Hale Metrics library. Can be reported via HTTP, JMX, and CSV files.
  • #33 Serialization: default and Kryo Tune Executor Memory Fraction: RDD Storage (60%), Shuffle and Aggregation Buffers (20%), User code (20%) Tune storage level: store in memory and/or on disk store as unserialized/serialized objects replicate each partition on 1 or 2 cluster nodes store in Tachyon Level of Parallelism: spark.task.cpus 1 task per partition using 1 core to execute spark.default.parallelism can be controlled: repartition() and coalescence() functions degree of parallelism as a operations parameter storage system matters Data locality: check data locality via UI configure data locality settings if needed spark.locality.wait timeout execute certain jobs on a driver spark.localExecution.enabled
  • #34 API can be experimental or used just for development Spark Java API can be not up-to-date as Scala API is main focus