Introduction to Real-time Big Data with Apache Spark
Introduction
About Me https://ua.linkedin.com/in/tarasmatyashovsky
Agenda • Buzzwords • Spark in a Nutshell • Spark Concepts • Spark Core • live demo session • Spark SQL • live demo session • Road to Production • Spark Drawbacks • Our Spark Integration • Spark is on a Rise
Buzzword for large and complex data sets difficult to process using on-hand database management tools or traditional data processing applications https://www.linkedin.com/pulse/decoding-buzzwords-big-data-predictive-analytics-business-gordon
http://www.ibmbigdatahub.com/infographic/four-vs-big-data
Jesus Christ, It is Big Data, Get Hadoop! by Sergey Shelpuk (https://ua.linkedin.com/in/shelpuk) at AI Club Meetup in Lviv
To Hadoop? http://www.thoughtworks.com/insights/blog/hadoop-or-not-hadoop • Batch mode, not real-time • Unstructured or semi-structured data • MapReduce programming model, e.g. key/value pairs
Not to Hadoop? • Real-time, streaming • Structures which could not be decomposed to key-value pairs • Jobs/algorithms which do not yield to the MapReduce programming model http://www.thoughtworks.com/insights/blog/hadoop-or-not-hadoop
Not to Hadoop? • Subset of data is enough Remove excessive complexity or shrink data set via other processing techniques, e.g.: hashing, clusterization • Random, Interactive Access to Data Well structured data Bunch of scalable mature (No)SQL DB solutions exist (Hbase/Cassandra/Columnar scalable DW engines) • Sensitive Data Security is still very challenging and immature
Why Spark? As of mid 2014, Spark is the most active Big Data project http://www.slideshare.net/databricks/new-direction-for-spark-in-2015-spark-summit-east Contributors per month to Spark
Spark Fast and general-purpose cluster computing platform for large-scale data processing
History
Time to Sort 100TB http://www.slideshare.net/databricks/new-direction-for-spark-in-2015-spark-summit-east
Why Spark is Faster? Spark processes data in-memory while Hadoop persists back to the disk after a map/reduce action
Powered by Spark https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark
Components Stack https://databricks.com/blog/2015/02/09/learning-spark-book-available-from-oreilly.html
Core Concepts automatically distribute data across cluster and parallelize operations performed on them
Distributed Application https://databricks.com/blog/2015/02/09/learning-spark-book-available-from-oreilly.html
http://spark.apache.org/docs/latest/cluster-overview.html
Spark Core Abstractions
RDD API Transformations: • filter() • map() • flatMap() • distinct() • union() • intersection() • subtract() • etc. Actions: • collect() • reduce() • count() • countByValue() • first() • take() • top() • etc.
RDD Operations • transformations are executed on workers • actions may transfer data from the workers to the driver • сollect() sends all the partitions to the single driver http://www.slideshare.net/databricks/strata-sj-everyday-im-shuffling-tips-for-writing-better-spark-programs
Pair RDD Transformations: • reduceByKey() • groupByKey() • sortByKey() • keys() • values() • join() • etc. Actions: • countByKey() • collectAsMap() • lookup() • etc.
Sample Application https://github.com/tmatyashovsky/spark-samples-jeeconf-kyiv
Requirements Analytics about Morning@Lohika events: • unique participants by companies • most loyal participants • participants by position • etc. https://github.com/tmatyashovsky/spark-samples-jeeconf-kyiv
Data Format Simple CSV files all fields are optional First Name Last Name Company Position Email Present Vladimir Tsukur GlobalLogic Tech/Team Lead flushdia@gmail.com 1 Mikalai Alimenkou XP Injection Tech Lead mikalai.alimenkou@ xpinjection.com 1 Taras Matyashovsky Lohika Software Engineer taras.matyashovsky@ gmail.com 0 https://github.com/tmatyashovsky/spark-samples-jeeconf-kyiv
Technologies Technologies: • Spring Boot 1.2.3.RELEASE • Spark 1.3.1 - released April 17, 2015 • 2 Spark jar dependencies • Apache 2.0 license, i.e. free to use https://github.com/tmatyashovsky/spark-samples-jeeconf-kyiv
Features • simple HTTP-based API • file system: local and HDFS • data formats: CSV and Parquet • 3 compatible implementations based on: • RDD (Spark Core) • Data Frame DSL (Spark SQL) • Data Frame SQL (Spark SQL) • serialization: default Java and Kryo https://github.com/tmatyashovsky/spark-samples-jeeconf-kyiv
Demo Time https://github.com/tmatyashovsky/spark-samples-jeeconf-kyiv
Cluster Manager Worker Driver Spark Context Executor Task Worker Executor Task http://spark.apache.org/docs/latest/cluster-overview.html Task Task Demo Explained
Limited opportunities for automatic optimization Functional Programming API Drawback
Structured data processing Spark SQL
Distributed collection of data organized into named columns Data Frame
Data Frame API • selecting columns • joining different data sources • aggregation, e.g. sum, count, average • filtering
Plan Optimization & Execution http://web.eecs.umich.edu/~prabal/teaching/resources/eecs582/armbrust15sparksql.pdf
Faster than RDD http://www.slideshare.net/databricks/spark-sqlsse2015public
Demo Time https://github.com/tmatyashovsky/spark-samples-jeeconf-kyiv
Persistence & Caching • by default stores the data in the JVM heap as unserialized objects • possibility to store on disk as unserialized/serialized objects • off-heap caching is experimental and uses
https://spark.apache.org/docs/latest/running-on-mesos.html
Cluster Manager should be chosen and configured properly
Monitoring via web UI(s) and metrics
Monitoring • master web UI • worker web UI • driver web UI • available only during execution • history server • spark.eventLog.enabled = true
Metrics • based on Coda Hale Metrics library • can be reported via HTTP, JMX, and CSV files
https://spark.apache.org/docs/latest/tuning.html
Serialization https://spark.apache.org/docs/latest/configuration.html#compression-and-serialization
Memory Management Tune Executor Memory Fraction RDD Storage (60%) Shuffle and aggregation buffers (20%) User code (20%) https://spark.apache.org/docs/latest/configuration.html#shuffle-behavior
Memory Management Tune storage level: • store in memory and/or on disk • store as unserialized/serialized objects • replicate each partition on 1 or 2 cluster nodes • store in Tachyon https://spark.apache.org/docs/latest/programming-guide.html#which-storage-level-to-choose
Level of Parallelism • spark.task.cpus • 1 task per partition using 1 core to execute • spark.default.parallelism • can be controlled: • repartition() and coalescence() functions • degree of parallelism as a operations parameter • storage system matters
Data Locality • check data locality via UI • configure data locality settings if needed • spark.locality.wait timeout • execute certain jobs on a driver • spark.localExecution.enabled
Java API Drawbacks • API can be experimental or used just for development • Spark Java API can be not up-to-date as Scala API is main focus
Our Spark Integration
Product Cloud-based analytics application
Use Cases • supplement Neo4j database used to store/query big dimensions • supplement RDBMS for querying of high volumes of data
Use Cases • represent existing computational graph as flow of Spark-based operations • predictive analytics based on Spark MLib component
Lessons Learned • Spark simplicity is deceptive • Each use case is unique • Be really aware: • Databricks blog • Mailing lists & Jira • Pull requests Spark is kind of magic
Spark is on a Rise
http://www.techrepublic.com/article/can-anything-dim-apache-spark/
Project Tungsten • the largest change to Spark’s execution engine since the project’s inception • focuses on substantially improving the efficiency of memory and CPU for Spark applications • sun.misc.Unsafe https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html
Thank you! Taras Matyashovsky taras.matyashovsky@gmail.com @tmatyashovsky http://www.filevych.com/
References https://www.linkedin.com/pulse/decoding-buzzwords-big-data-predictive-analytics- business-gordon http://www.ibmbigdatahub.com/infographic/four-vs-big-data http://www.thoughtworks.com/insights/blog/hadoop-or-not-hadoop http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app- models/ Learning Spark, by Holden Karau, Andy Konwinski, Patrick Wendell and Matei Zaharia (early release ebook from O'Reilly Media) https://spark-prs.appspot.com/#all https://www.gitbook.com/book/databricks/databricks-spark-knowledge-base/details http://insidebigdata.com/2015/03/06/8-reasons-apache-spark-hot/ https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark http://databricks.com/blog/2014/10/10/spark-petabyte-sort.html http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale- sorting.html http://web.eecs.umich.edu/~prabal/teaching/resources/eecs582/armbrust15sparksql.pdf http://www.slideshare.net/databricks/strata-sj-everyday-im-shuffling-tips-for-writing- better-spark-programs http://www.slideshare.net/databricks/new-direction-for-spark-in-2015-spark-summit-east http://www.slideshare.net/databricks/spark-sqlsse2015public https://spark.apache.org/docs/latest/running-on-mesos.html http://spark.apache.org/docs/latest/cluster-overview.html http://www.techrepublic.com/article/can-anything-dim-apache-spark/ http://spark-packages.org/

Introduction to real time big data with Apache Spark

Editor's Notes

  • #48 Cluster Manager: Standalone, Apache Mesos, Hadoop Yarn Cluster Manager should be chosen and configured properly Monitoring via web UI(s) and metrics Web UI: master web UI worker web UI driver web UI - available only during execution history server - spark.eventLog.enabled = true Metrics based on Coda Hale Metrics library. Can be reported via HTTP, JMX, and CSV files.
  • #54 Serialization: default and Kryo Tune Executor Memory Fraction: RDD Storage (60%), Shuffle and Aggregation Buffers (20%), User code (20%) Tune storage level: store in memory and/or on disk store as unserialized/serialized objects replicate each partition on 1 or 2 cluster nodes store in Tachyon Level of Parallelism: spark.task.cpus 1 task per partition using 1 core to execute spark.default.parallelism can be controlled: repartition() and coalescence() functions degree of parallelism as a operations parameter storage system matters Data locality: check data locality via UI configure data locality settings if needed spark.locality.wait timeout execute certain jobs on a driver spark.localExecution.enabled