Apache spark - Architecture , Overview & libraries

Walaa Assy Giza Systems Software Developer SPARK LIGHTNING-FAST UNIFIED ANALYTICS ENGINE

HOW DO WE HANDLE EVER GROWING DATA THAT HAS BECOME BIG DATA?

Basics of Spark Core API  Cluster Managers Spark Maintenance Libraries  - SQL  - Streaming  - Mllib  GraphX Troubleshooting / Future of Spark AGENDA

 Readability  Expressiveness  Fast  Testability  Interactive  Fault Tolerant  Unify Big Data Spark officially sets a new record in large scale sorting, spark does make computations on disk it makes use of cached data in memory WHY SPARK? TINIER CODE LEADS TO ..

 Map reduce has very narrow scope especially in batch processing  Each problem needed a new api to solve EXPLOSION OF MAP REDUCE

A UNIFIED PLATFORM FOR BIG DATA

 The most basic abstraction of spark  Spark operations are two main categories:  Transformations [lazily evalutaed only storing the intent]  Actions  val textFile = sc.textFile("file:///spark/README.md")  textFile.first // action RDD [RESILIETNT DISTRIBUTION DATASET]

 sudo yum install wget  sudo wget https://downloads.lightbend.com/scala/2.13.0- M4/scala-2.13.0-M4.tgz  tar xvf scala-2.13.0-M4.tgz  sudo mv scala-2.13.0-M4 /usr/lib  sudo ln -s /usr/lib/scala-2.13.0-M4 /usr/lib/scala  export PATH=$PATH:/usr/lib/scala/bin SCALA INSTALLATION STEPS

 sudo wget https://www.apache.org/dyn/closer.lua/spark/spark- 2.3.1/spark-2.3.1-bin-hadoop2.7.tgz  tar xvf spark-2.3.1-bin-hadoop2.7.tgz  ln -s spark-2.3.1-bin-hadoop2.7 spark  export SPARK_HOME=$HOME/spark-2.3.0-bin-hadoop2.7  export PATH=$PATH:$SPARK_HOME/bin SPARK INSTALLATION – CENTOS 7

 collection of elements partitioned across the nodes of the cluster that can be operated on in parallel…  A collection similar to a list or an array from a user level  processed in parallel to fasten computation time with no failure tolerance  RDD is immutable  Transformations are lazy and stored in a DAG  Actions trigger DAGs  DAGS are like linear graph of tasks  Each action will trigger a fresh execution of the graph RDD

 Map  Flatmap  Filter  Distinct  Sample  Union  Inttersection  Subtract  Cartesian Transformations return RDDs TRANSFORMATIONS IN MAP REDUCE

 Collect()  Count()  Take(num)  takeOrdered(num)(ordering)  Reduce(function)  Aggregate(zeroValue)(seqOp,compOp)  Foreach(function)  Actions return different types according to each action saveAsObjectFile(path) saveAsTextFile(path) // saves as text file External connector foreach(T => Unit) // one object at a time  - foreachPartition(Iterator[T] => Unit) // one partition at a time ACTIONS IN SPARK

 Sql like pairing  Join  fullOuterJoin  leftJoin  rightJoin  Pair Saving  saveAs(NewAPI)HadoopFile  - path  - keyClass  - valueClass  - outputFormatClass  saveAs(NewAPI)HadoopData Set  - conf  saveAsSequenceFile  Pair Saving  - saveAsHadoopFile(path, keyClass, valueClass, SequenceFileOutputFormat) PAIR METHODS- CONTD

 Works Like a distributed kernel  Built in a basic spark manager  Haddop cluster manager yarn  Apache mesos standalone PRIMARY CLUSTER MANAGER

 Spark SQL is Apache Spark's module for working with structured or semi data.  It is meant to be used by non big data users  As Spark continues to grow, we want to enable wider audiences beyond “Big Data” engineers to leverage the power of distributed processing. Databricks blog (http://bit.ly/17NM70s) SPARK SQL

 Seamlessly mix SQL queries with Spark programs Spark SQL lets you query structured data inside Spark programs, using either SQL or a familiar DataFrame API  Connect to any data source the same way.  It executes SQL queries.  We can read data from existing Hive installation using SparkSQL.  When we run SQL within another programming language we will get the result as Dataset/DataFrame. SPARK SQL FEATURES

DataFrames and SQL provide a common way to access a variety of data sources, including Hive, Avro, Parquet, ORC, JSON, and JDBC. You can even join data across these sources.  Run SQL or HiveQL queries on existing warehouses.[Hive Integration]  Connect through JDBC or ODBC.[Standard Connectivity]  It is includes with spark DATAFRAMES

 Spark 1.3 release. It is a distributed collection of data ordered into named columns. Concept wise it is equal to the table in a relational database or a data frame in R/Python. We can create DataFrame using:  Structured data files  Tables in Hive  External databases  Using existing RDD SPARK DATAFRAME IS Data frames = schem RDD

 Hive  Parquet  Json  Avro  Amazon red shift  Csv  Others It is recommended as a starting point for any spark application As it adds  Predicate push down  Column pruning  Can use SQL & RDD SPARK SQL DATA SOURCES

 Big & fast data  Gigabytes per second  Real time fraud detection  Marketing  makes it easy to build scalable fault-tolerant streaming applications. SPARK STREAMING

SPARK STREAMING COMPETITORS Streaming data • Kafka • Flume • Twitter • Hadoop hdfs • Others • live logs, system telemetry data, IoT device data, etc.)

 MLlib is a standard component of Spark providing machine learning primitives on top of Spark. SPARK MLIB

 MATLAB  R EASY TO USE BUT NOT SCALABLE  MAHOUT  GRAPHLAB Scalable but at the cost ease  Org.apache.spark.mlib Rdd based algoritms  Org.aoache.spark.ml  Pipleline api built on top of dataframes SPARK MLIB COMPETITION

 Loding the data  Extracting features  Training the data  Testing  the data  The new pipeline allows tuning testing and early failure detection MACHINE LEARNING FLOW

 Algorithms Classifcation ex: naïve bayes Regression Linear Logistic Filtering by als ,k squares Clustering by k-means Dimensional reduction by SVD singular value decomposition  Feature extraction and transformations Tf-idf : term frequency- inverse document frequency ALGRITHMS IN MLIB

 Spam filtering  Fraud detection  Recommendation analysis  Speech recognition PRACTICAL USE

 Word to vector algorithm  This algorithm takes an input text and outputs a set of vectors representing a dictionary of words [to see word similarity]  We cache the rdds because mlib will have multiple passes o the same data so this memory cache can reduce processing time alot  breeze numerical processing library used inside of spark  It has ability to perform mathematical operations on vectors MLIB DEMO

 GraphX is Apache Spark's API for graphs and graph-parallel computation.  Page ranking  Producing evaluations  It can be used in genetic analysis  ALGORITHMS  PageRank  Connected components  Label propagation  SVD++  Strongly connected components  Triangle count GRAPHX - FROM A TABLE STRUCTUED LIKE TO A GRAHP STRUCTURED WORLD

COMPETITONS End-to-end PageRank performance (20 iterations, 3.7B edges)

 Joints each had unique id  Each vertex can has properties of user defined type and store metal data ARCHITECTURE

 Arrows are relations that can store metadata data known as edges which is a long type  A graph is built of two RDDs one containing the collection of edges and the collection of vertices

 Another component is edge triplet is an object which exposes the relation between each vertex and edge containing all the information for each connection

 http://spark.apache.org  Tutorials: http://ampcamp.berkeley.edu  Spark Summit: http://spark-summit.org  Github: https://github.com/apache/spark  https://data-flair.training/blogs/spark-sql-tutorial/ REFERENCES

Apache spark - Architecture , Overview & libraries

More Related Content

What's hot

Similar to Apache spark - Architecture , Overview & libraries

Recently uploaded

In this document

Apache spark - Architecture , Overview & libraries