JEEConf 2015 - Introduction to real-time big data with Apache Spark

Introduction to Real-time Big Data with Apache Spark

About Me https://ua.linkedin.com/in/tarasmatyashovsky

Spark Fast and general-purpose cluster computing platform for large-scale data processing

Why Spark? As of mid 2014, Spark is the most active Big Data project http://www.slideshare.net/databricks/new-direction-for-spark-in-2015-spark-summit-east Contributors per month to Spark

Time to Sort 100TB http://www.slideshare.net/databricks/new-direction-for-spark-in-2015-spark-summit-east

Why Spark is Faster? Spark processes data in-memory while Hadoop persists back to the disk after a map/reduce action

Powered by Spark https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark

Components Stack https://databricks.com/blog/2015/02/09/learning-spark-book-available-from-oreilly.html

Core Concepts automatically distribute data across cluster and parallelize operations performed on them

Distributed Application https://databricks.com/blog/2015/02/09/learning-spark-book-available-from-oreilly.html

RDD API Transformations: • filter() • map() • flatMap() • distinct() • union() • intersection() • subtract() • etc. Actions: • collect() • reduce() • count() • countByValue() • first() • take() • top() • etc. http://www.slideshare.net/databricks/strata-sj-everyday-im-shuffling-tips-for-writing-better-spark-programs

Sample Application https://github.com/tmatyashovsky/spark-samples-jeeconf-kyiv

Requirements Analytics about Morning@Lohika events: • unique participants by companies • most loyal participants • participants by position • etc. https://github.com/tmatyashovsky/spark-samples-jeeconf-kyiv

Data Format Simple CSV files all fields are optional First Name Last Name Company Position Email Present Vladimir Tsukur GlobalLogic Tech/Team Lead flushdia@gmail.com 1 Mikalai Alimenkou XP Injection Tech Lead mikalai.alimenkou@ xpinjection.com 1 Taras Matyashovsky Lohika Software Engineer taras.matyashovsky@ gmail.com 0 https://github.com/tmatyashovsky/spark-samples-jeeconf-kyiv

Demo Time https://github.com/tmatyashovsky/spark-samples-jeeconf-kyiv

Cluster Manager Worker Driver Spark Context Executor Task Worker Executor Task http://spark.apache.org/docs/latest/cluster-overview.html Task Task Demo Explained

Structured data processing Spark SQL

Distributed collection of data organized into named columns Data Frame

Data Frame API • selecting columns • joining different data sources • aggregation, e.g. sum, count, average • filtering

Plan Optimization & Execution http://web.eecs.umich.edu/~prabal/teaching/resources/eecs582/armbrust15sparksql.pdf

Faster than RDD http://www.slideshare.net/databricks/spark-sqlsse2015public

https://spark.apache.org/docs/latest/tuning.html

Product Cloud-based analytics application

Use Cases • supplement Neo4j database used to store/query big dimensions • supplement RDBMS for querying of high volumes of data

Use Cases • represent existing computational graph as flow of Spark-based operations • predictive analytics based on Spark MLib component

Lessons Learned • Spark simplicity is deceptive • Each use case is unique • Be really aware: • Databricks blog • Mailing lists & Jira • Pull requests Spark is kind of magic

http://www.techrepublic.com/article/can-anything-dim-apache-spark/

Project Tungsten • the largest change to Spark’s execution engine since the project’s inception • focuses on substantially improving the efficiency of memory and CPU for Spark applications • sun.misc.Unsafe https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html

Thank you! Taras Matyashovsky taras.matyashovsky@gmail.com @tmatyashovsky http://www.filevych.com/

References https://www.linkedin.com/pulse/decoding-buzzwords-big-data-predictive-analytics- business-gordon http://www.ibmbigdatahub.com/infographic/four-vs-big-data http://www.thoughtworks.com/insights/blog/hadoop-or-not-hadoop http://blog.cloudera.com/blog/2014/05/apache-spark-resource-management-and-yarn-app- models/ Learning Spark, by Holden Karau, Andy Konwinski, Patrick Wendell and Matei Zaharia (early release ebook from O'Reilly Media) https://spark-prs.appspot.com/#all https://www.gitbook.com/book/databricks/databricks-spark-knowledge-base/details http://insidebigdata.com/2015/03/06/8-reasons-apache-spark-hot/ https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark http://databricks.com/blog/2014/10/10/spark-petabyte-sort.html http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale- sorting.html http://web.eecs.umich.edu/~prabal/teaching/resources/eecs582/armbrust15sparksql.pdf http://www.slideshare.net/databricks/strata-sj-everyday-im-shuffling-tips-for-writing- better-spark-programs http://www.slideshare.net/databricks/new-direction-for-spark-in-2015-spark-summit-east http://www.slideshare.net/databricks/spark-sqlsse2015public https://spark.apache.org/docs/latest/running-on-mesos.html http://spark.apache.org/docs/latest/cluster-overview.html http://www.techrepublic.com/article/can-anything-dim-apache-spark/ http://spark-packages.org/

JEEConf 2015 - Introduction to real-time big data with Apache Spark

More Related Content

What's hot

Viewers also liked

Similar to JEEConf 2015 - Introduction to real-time big data with Apache Spark

More from Taras Matyashovsky

Recently uploaded

JEEConf 2015 - Introduction to real-time big data with Apache Spark

Editor's Notes