Introduction to real time big data with Apache Spark

Introduction to Real-time Big Data with Apache Spark

About Me https://ua.linkedin.com/in/tarasmatyashovsky

Agenda • Buzzwords • Spark in a Nutshell • Spark Concepts • Spark Core • live demo session • Spark SQL • live demo session • Road to Production • Spark Drawbacks • Our Spark Integration • Spark is on a Rise

Buzzword for large and complex data sets difficult to process using on-hand database management tools or traditional data processing applications https://www.linkedin.com/pulse/decoding-buzzwords-big-data-predictive-analytics-business-gordon

http://www.ibmbigdatahub.com/infographic/four-vs-big-data

Jesus Christ, It is Big Data, Get Hadoop! by Sergey Shelpuk (https://ua.linkedin.com/in/shelpuk) at AI Club Meetup in Lviv

To Hadoop? http://www.thoughtworks.com/insights/blog/hadoop-or-not-hadoop • Batch mode, not real-time • Unstructured or semi-structured data • MapReduce programming model, e.g. key/value pairs

Not to Hadoop? • Real-time, streaming • Structures which could not be decomposed to key-value pairs • Jobs/algorithms which do not yield to the MapReduce programming model http://www.thoughtworks.com/insights/blog/hadoop-or-not-hadoop

Not to Hadoop? • Subset of data is enough Remove excessive complexity or shrink data set via other processing techniques, e.g.: hashing, clusterization • Random, Interactive Access to Data Well structured data Bunch of scalable mature (No)SQL DB solutions exist (Hbase/Cassandra/Columnar scalable DW engines) • Sensitive Data Security is still very challenging and immature

Why Spark? As of mid 2014, Spark is the most active Big Data project http://www.slideshare.net/databricks/new-direction-for-spark-in-2015-spark-summit-east Contributors per month to Spark

Spark Fast and general-purpose cluster computing platform for large-scale data processing

Time to Sort 100TB http://www.slideshare.net/databricks/new-direction-for-spark-in-2015-spark-summit-east

Why Spark is Faster? Spark processes data in-memory while Hadoop persists back to the disk after a map/reduce action

Powered by Spark https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark

Components Stack https://databricks.com/blog/2015/02/09/learning-spark-book-available-from-oreilly.html

Core Concepts automatically distribute data across cluster and parallelize operations performed on them

Distributed Application https://databricks.com/blog/2015/02/09/learning-spark-book-available-from-oreilly.html

http://spark.apache.org/docs/latest/cluster-overview.html

RDD API Transformations: • filter() • map() • flatMap() • distinct() • union() • intersection() • subtract() • etc. Actions: • collect() • reduce() • count() • countByValue() • first() • take() • top() • etc.

RDD Operations • transformations are executed on workers • actions may transfer data from the workers to the driver • сollect() sends all the partitions to the single driver http://www.slideshare.net/databricks/strata-sj-everyday-im-shuffling-tips-for-writing-better-spark-programs

Pair RDD Transformations: • reduceByKey() • groupByKey() • sortByKey() • keys() • values() • join() • etc. Actions: • countByKey() • collectAsMap() • lookup() • etc.

Sample Application https://github.com/tmatyashovsky/spark-samples-jeeconf-kyiv

Requirements Analytics about Morning@Lohika events: • unique participants by companies • most loyal participants • participants by position • etc. https://github.com/tmatyashovsky/spark-samples-jeeconf-kyiv

Data Format Simple CSV files all fields are optional First Name Last Name Company Position Email Present Vladimir Tsukur GlobalLogic Tech/Team Lead flushdia@gmail.com 1 Mikalai Alimenkou XP Injection Tech Lead mikalai.alimenkou@ xpinjection.com 1 Taras Matyashovsky Lohika Software Engineer taras.matyashovsky@ gmail.com 0 https://github.com/tmatyashovsky/spark-samples-jeeconf-kyiv

Technologies Technologies: • Spring Boot 1.2.3.RELEASE • Spark 1.3.1 - released April 17, 2015 • 2 Spark jar dependencies • Apache 2.0 license, i.e. free to use https://github.com/tmatyashovsky/spark-samples-jeeconf-kyiv

Features • simple HTTP-based API • file system: local and HDFS • data formats: CSV and Parquet • 3 compatible implementations based on: • RDD (Spark Core) • Data Frame DSL (Spark SQL) • Data Frame SQL (Spark SQL) • serialization: default Java and Kryo https://github.com/tmatyashovsky/spark-samples-jeeconf-kyiv

Demo Time https://github.com/tmatyashovsky/spark-samples-jeeconf-kyiv

Cluster Manager Worker Driver Spark Context Executor Task Worker Executor Task http://spark.apache.org/docs/latest/cluster-overview.html Task Task Demo Explained

Limited opportunities for automatic optimization Functional Programming API Drawback

Structured data processing Spark SQL

Distributed collection of data organized into named columns Data Frame

Data Frame API • selecting columns • joining different data sources • aggregation, e.g. sum, count, average • filtering

Plan Optimization & Execution http://web.eecs.umich.edu/~prabal/teaching/resources/eecs582/armbrust15sparksql.pdf

Faster than RDD http://www.slideshare.net/databricks/spark-sqlsse2015public

Persistence & Caching • by default stores the data in the JVM heap as unserialized objects • possibility to store on disk as unserialized/serialized objects • off-heap caching is experimental and uses

https://spark.apache.org/docs/latest/running-on-mesos.html

Cluster Manager should be chosen and configured properly

Monitoring via web UI(s) and metrics

Monitoring • master web UI • worker web UI • driver web UI • available only during execution • history server • spark.eventLog.enabled = true

Metrics • based on Coda Hale Metrics library • can be reported via HTTP, JMX, and CSV files

https://spark.apache.org/docs/latest/tuning.html

Serialization https://spark.apache.org/docs/latest/configuration.html#compression-and-serialization

Memory Management Tune Executor Memory Fraction RDD Storage (60%) Shuffle and aggregation buffers (20%) User code (20%) https://spark.apache.org/docs/latest/configuration.html#shuffle-behavior

Memory Management Tune storage level: • store in memory and/or on disk • store as unserialized/serialized objects • replicate each partition on 1 or 2 cluster nodes • store in Tachyon https://spark.apache.org/docs/latest/programming-guide.html#which-storage-level-to-choose

Level of Parallelism • spark.task.cpus • 1 task per partition using 1 core to execute • spark.default.parallelism • can be controlled: • repartition() and coalescence() functions • degree of parallelism as a operations parameter • storage system matters

Data Locality • check data locality via UI • configure data locality settings if needed • spark.locality.wait timeout • execute certain jobs on a driver • spark.localExecution.enabled

Java API Drawbacks • API can be experimental or used just for development • Spark Java API can be not up-to-date as Scala API is main focus

Product Cloud-based analytics application

Use Cases • supplement Neo4j database used to store/query big dimensions • supplement RDBMS for querying of high volumes of data

Use Cases • represent existing computational graph as flow of Spark-based operations • predictive analytics based on Spark MLib component

Lessons Learned • Spark simplicity is deceptive • Each use case is unique • Be really aware: • Databricks blog • Mailing lists & Jira • Pull requests Spark is kind of magic

http://www.techrepublic.com/article/can-anything-dim-apache-spark/

Project Tungsten • the largest change to Spark’s execution engine since the project’s inception • focuses on substantially improving the efficiency of memory and CPU for Spark applications • sun.misc.Unsafe https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html

Thank you! Taras Matyashovsky taras.matyashovsky@gmail.com @tmatyashovsky http://www.filevych.com/

References https://www.linkedin.com/pulse/decoding-buzzwords-big-data-predictive-analytics- business-gordon http://www.ibmbigdatahub.com/infographic/four-vs-big-data http://www.thoughtworks.com/insights/blog/hadoop-or-not-hadoop http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app- models/ Learning Spark, by Holden Karau, Andy Konwinski, Patrick Wendell and Matei Zaharia (early release ebook from O'Reilly Media) https://spark-prs.appspot.com/#all https://www.gitbook.com/book/databricks/databricks-spark-knowledge-base/details http://insidebigdata.com/2015/03/06/8-reasons-apache-spark-hot/ https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark http://databricks.com/blog/2014/10/10/spark-petabyte-sort.html http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale- sorting.html http://web.eecs.umich.edu/~prabal/teaching/resources/eecs582/armbrust15sparksql.pdf http://www.slideshare.net/databricks/strata-sj-everyday-im-shuffling-tips-for-writing- better-spark-programs http://www.slideshare.net/databricks/new-direction-for-spark-in-2015-spark-summit-east http://www.slideshare.net/databricks/spark-sqlsse2015public https://spark.apache.org/docs/latest/running-on-mesos.html http://spark.apache.org/docs/latest/cluster-overview.html http://www.techrepublic.com/article/can-anything-dim-apache-spark/ http://spark-packages.org/

Introduction to real time big data with Apache Spark

More Related Content

What's hot

Viewers also liked

Similar to Introduction to real time big data with Apache Spark

More from Taras Matyashovsky

Recently uploaded

Introduction to real time big data with Apache Spark

Editor's Notes