Walaa Assy Giza Systems Software Developer SPARK LIGHTNING-FAST UNIFIED ANALYTICS ENGINE
HOW DO WE HANDLE EVER GROWING DATA THAT HAS BECOME BIG DATA?
Basics of Spark Core API  Cluster Managers Spark Maintenance Libraries  - SQL  - Streaming  - Mllib  GraphX Troubleshooting / Future of Spark AGENDA
 Readability  Expressiveness  Fast  Testability  Interactive  Fault Tolerant  Unify Big Data Spark officially sets a new record in large scale sorting, spark does make computations on disk it makes use of cached data in memory WHY SPARK? TINIER CODE LEADS TO ..
 Map reduce has very narrow scope especially in batch processing  Each problem needed a new api to solve EXPLOSION OF MAP REDUCE
A UNIFIED PLATFORM FOR BIG DATA
SPARK PROGRAMMING LANGUAGES
 The most basic abstraction of spark  Spark operations are two main categories:  Transformations [lazily evalutaed only storing the intent]  Actions  val textFile = sc.textFile("file:///spark/README.md")  textFile.first // action RDD [RESILIETNT DISTRIBUTION DATASET]
HELLO BIG DATA
 sudo yum install wget  sudo wget https://downloads.lightbend.com/scala/2.13.0- M4/scala-2.13.0-M4.tgz  tar xvf scala-2.13.0-M4.tgz  sudo mv scala-2.13.0-M4 /usr/lib  sudo ln -s /usr/lib/scala-2.13.0-M4 /usr/lib/scala  export PATH=$PATH:/usr/lib/scala/bin SCALA INSTALLATION STEPS
 sudo wget https://www.apache.org/dyn/closer.lua/spark/spark- 2.3.1/spark-2.3.1-bin-hadoop2.7.tgz  tar xvf spark-2.3.1-bin-hadoop2.7.tgz  ln -s spark-2.3.1-bin-hadoop2.7 spark  export SPARK_HOME=$HOME/spark-2.3.0-bin-hadoop2.7  export PATH=$PATH:$SPARK_HOME/bin SPARK INSTALLATION – CENTOS 7
SPARK MECHANISM
 collection of elements partitioned across the nodes of the cluster that can be operated on in parallel…  A collection similar to a list or an array from a user level  processed in parallel to fasten computation time with no failure tolerance  RDD is immutable  Transformations are lazy and stored in a DAG  Actions trigger DAGs  DAGS are like linear graph of tasks  Each action will trigger a fresh execution of the graph RDD
INPUT DATASETS TYPES
 Map  Flatmap  Filter  Distinct  Sample  Union  Inttersection  Subtract  Cartesian Transformations return RDDs TRANSFORMATIONS IN MAP REDUCE
 Collect()  Count()  Take(num)  takeOrdered(num)(ordering)  Reduce(function)  Aggregate(zeroValue)(seqOp,compOp)  Foreach(function)  Actions return different types according to each action saveAsObjectFile(path) saveAsTextFile(path) // saves as text file External connector foreach(T => Unit) // one object at a time  - foreachPartition(Iterator[T] => Unit) // one partition at a time ACTIONS IN SPARK
 Sql like pairing  Join  fullOuterJoin  leftJoin  rightJoin  Pair Saving  saveAs(NewAPI)HadoopFile  - path  - keyClass  - valueClass  - outputFormatClass  saveAs(NewAPI)HadoopData Set  - conf  saveAsSequenceFile  Pair Saving  - saveAsHadoopFile(path, keyClass, valueClass, SequenceFileOutputFormat) PAIR METHODS- CONTD
 Works Like a distributed kernel  Built in a basic spark manager  Haddop cluster manager yarn  Apache mesos standalone PRIMARY CLUSTER MANAGER
SPARK-SUBMIT DEMO
SPARK SQL
 Spark SQL is Apache Spark's module for working with structured or semi data.  It is meant to be used by non big data users  As Spark continues to grow, we want to enable wider audiences beyond “Big Data” engineers to leverage the power of distributed processing. Databricks blog (http://bit.ly/17NM70s) SPARK SQL
 Seamlessly mix SQL queries with Spark programs Spark SQL lets you query structured data inside Spark programs, using either SQL or a familiar DataFrame API  Connect to any data source the same way.  It executes SQL queries.  We can read data from existing Hive installation using SparkSQL.  When we run SQL within another programming language we will get the result as Dataset/DataFrame. SPARK SQL FEATURES
DataFrames and SQL provide a common way to access a variety of data sources, including Hive, Avro, Parquet, ORC, JSON, and JDBC. You can even join data across these sources.  Run SQL or HiveQL queries on existing warehouses.[Hive Integration]  Connect through JDBC or ODBC.[Standard Connectivity]  It is includes with spark DATAFRAMES
 Spark 1.3 release. It is a distributed collection of data ordered into named columns. Concept wise it is equal to the table in a relational database or a data frame in R/Python. We can create DataFrame using:  Structured data files  Tables in Hive  External databases  Using existing RDD SPARK DATAFRAME IS Data frames = schem RDD
EXAMPLES
SPARK SQL COMPETITION
 Hive  Parquet  Json  Avro  Amazon red shift  Csv  Others It is recommended as a starting point for any spark application As it adds  Predicate push down  Column pruning  Can use SQL & RDD SPARK SQL DATA SOURCES
SPARK STREAMING
 Big & fast data  Gigabytes per second  Real time fraud detection  Marketing  makes it easy to build scalable fault-tolerant streaming applications. SPARK STREAMING
SPARK STREAMING COMPETITORS Streaming data • Kafka • Flume • Twitter • Hadoop hdfs • Others • live logs, system telemetry data, IoT device data, etc.)
SPARK MLIB
 MLlib is a standard component of Spark providing machine learning primitives on top of Spark. SPARK MLIB
 MATLAB  R EASY TO USE BUT NOT SCALABLE  MAHOUT  GRAPHLAB Scalable but at the cost ease  Org.apache.spark.mlib Rdd based algoritms  Org.aoache.spark.ml  Pipleline api built on top of dataframes SPARK MLIB COMPETITION
 Loding the data  Extracting features  Training the data  Testing  the data  The new pipeline allows tuning testing and early failure detection MACHINE LEARNING FLOW
 Algorithms Classifcation ex: naïve bayes Regression Linear Logistic Filtering by als ,k squares Clustering by k-means Dimensional reduction by SVD singular value decomposition  Feature extraction and transformations Tf-idf : term frequency- inverse document frequency ALGRITHMS IN MLIB
 Spam filtering  Fraud detection  Recommendation analysis  Speech recognition PRACTICAL USE
 Word to vector algorithm  This algorithm takes an input text and outputs a set of vectors representing a dictionary of words [to see word similarity]  We cache the rdds because mlib will have multiple passes o the same data so this memory cache can reduce processing time alot  breeze numerical processing library used inside of spark  It has ability to perform mathematical operations on vectors MLIB DEMO
SPARK GRAPHX
 GraphX is Apache Spark's API for graphs and graph-parallel computation.  Page ranking  Producing evaluations  It can be used in genetic analysis  ALGORITHMS  PageRank  Connected components  Label propagation  SVD++  Strongly connected components  Triangle count GRAPHX - FROM A TABLE STRUCTUED LIKE TO A GRAHP STRUCTURED WORLD
COMPETITONS End-to-end PageRank performance (20 iterations, 3.7B edges)
 Joints each had unique id  Each vertex can has properties of user defined type and store metal data ARCHITECTURE
 Arrows are relations that can store metadata data known as edges which is a long type  A graph is built of two RDDs one containing the collection of edges and the collection of vertices
 Another component is edge triplet is an object which exposes the relation between each vertex and edge containing all the information for each connection
WHO IS USING SPARK?
 http://spark.apache.org  Tutorials: http://ampcamp.berkeley.edu  Spark Summit: http://spark-summit.org  Github: https://github.com/apache/spark  https://data-flair.training/blogs/spark-sql-tutorial/ REFERENCES

Apache spark - Architecture , Overview & libraries

  • 1.
  • 2.
    HOW DO WEHANDLE EVER GROWING DATA THAT HAS BECOME BIG DATA?
  • 3.
    Basics of Spark CoreAPI  Cluster Managers Spark Maintenance Libraries  - SQL  - Streaming  - Mllib  GraphX Troubleshooting / Future of Spark AGENDA
  • 5.
     Readability  Expressiveness Fast  Testability  Interactive  Fault Tolerant  Unify Big Data Spark officially sets a new record in large scale sorting, spark does make computations on disk it makes use of cached data in memory WHY SPARK? TINIER CODE LEADS TO ..
  • 6.
     Map reducehas very narrow scope especially in batch processing  Each problem needed a new api to solve EXPLOSION OF MAP REDUCE
  • 8.
    A UNIFIED PLATFORMFOR BIG DATA
  • 9.
  • 10.
     The mostbasic abstraction of spark  Spark operations are two main categories:  Transformations [lazily evalutaed only storing the intent]  Actions  val textFile = sc.textFile("file:///spark/README.md")  textFile.first // action RDD [RESILIETNT DISTRIBUTION DATASET]
  • 11.
  • 12.
     sudo yuminstall wget  sudo wget https://downloads.lightbend.com/scala/2.13.0- M4/scala-2.13.0-M4.tgz  tar xvf scala-2.13.0-M4.tgz  sudo mv scala-2.13.0-M4 /usr/lib  sudo ln -s /usr/lib/scala-2.13.0-M4 /usr/lib/scala  export PATH=$PATH:/usr/lib/scala/bin SCALA INSTALLATION STEPS
  • 13.
     sudo wget https://www.apache.org/dyn/closer.lua/spark/spark- 2.3.1/spark-2.3.1-bin-hadoop2.7.tgz tar xvf spark-2.3.1-bin-hadoop2.7.tgz  ln -s spark-2.3.1-bin-hadoop2.7 spark  export SPARK_HOME=$HOME/spark-2.3.0-bin-hadoop2.7  export PATH=$PATH:$SPARK_HOME/bin SPARK INSTALLATION – CENTOS 7
  • 14.
  • 15.
     collection ofelements partitioned across the nodes of the cluster that can be operated on in parallel…  A collection similar to a list or an array from a user level  processed in parallel to fasten computation time with no failure tolerance  RDD is immutable  Transformations are lazy and stored in a DAG  Actions trigger DAGs  DAGS are like linear graph of tasks  Each action will trigger a fresh execution of the graph RDD
  • 16.
  • 18.
     Map  Flatmap Filter  Distinct  Sample  Union  Inttersection  Subtract  Cartesian Transformations return RDDs TRANSFORMATIONS IN MAP REDUCE
  • 24.
     Collect()  Count() Take(num)  takeOrdered(num)(ordering)  Reduce(function)  Aggregate(zeroValue)(seqOp,compOp)  Foreach(function)  Actions return different types according to each action saveAsObjectFile(path) saveAsTextFile(path) // saves as text file External connector foreach(T => Unit) // one object at a time  - foreachPartition(Iterator[T] => Unit) // one partition at a time ACTIONS IN SPARK
  • 28.
     Sql likepairing  Join  fullOuterJoin  leftJoin  rightJoin  Pair Saving  saveAs(NewAPI)HadoopFile  - path  - keyClass  - valueClass  - outputFormatClass  saveAs(NewAPI)HadoopData Set  - conf  saveAsSequenceFile  Pair Saving  - saveAsHadoopFile(path, keyClass, valueClass, SequenceFileOutputFormat) PAIR METHODS- CONTD
  • 29.
     Works Likea distributed kernel  Built in a basic spark manager  Haddop cluster manager yarn  Apache mesos standalone PRIMARY CLUSTER MANAGER
  • 30.
  • 31.
  • 32.
     Spark SQLis Apache Spark's module for working with structured or semi data.  It is meant to be used by non big data users  As Spark continues to grow, we want to enable wider audiences beyond “Big Data” engineers to leverage the power of distributed processing. Databricks blog (http://bit.ly/17NM70s) SPARK SQL
  • 33.
     Seamlessly mixSQL queries with Spark programs Spark SQL lets you query structured data inside Spark programs, using either SQL or a familiar DataFrame API  Connect to any data source the same way.  It executes SQL queries.  We can read data from existing Hive installation using SparkSQL.  When we run SQL within another programming language we will get the result as Dataset/DataFrame. SPARK SQL FEATURES
  • 35.
    DataFrames and SQLprovide a common way to access a variety of data sources, including Hive, Avro, Parquet, ORC, JSON, and JDBC. You can even join data across these sources.  Run SQL or HiveQL queries on existing warehouses.[Hive Integration]  Connect through JDBC or ODBC.[Standard Connectivity]  It is includes with spark DATAFRAMES
  • 36.
     Spark 1.3release. It is a distributed collection of data ordered into named columns. Concept wise it is equal to the table in a relational database or a data frame in R/Python. We can create DataFrame using:  Structured data files  Tables in Hive  External databases  Using existing RDD SPARK DATAFRAME IS Data frames = schem RDD
  • 37.
  • 38.
  • 39.
     Hive  Parquet Json  Avro  Amazon red shift  Csv  Others It is recommended as a starting point for any spark application As it adds  Predicate push down  Column pruning  Can use SQL & RDD SPARK SQL DATA SOURCES
  • 40.
  • 41.
     Big &fast data  Gigabytes per second  Real time fraud detection  Marketing  makes it easy to build scalable fault-tolerant streaming applications. SPARK STREAMING
  • 42.
    SPARK STREAMING COMPETITORS Streamingdata • Kafka • Flume • Twitter • Hadoop hdfs • Others • live logs, system telemetry data, IoT device data, etc.)
  • 43.
  • 44.
     MLlib isa standard component of Spark providing machine learning primitives on top of Spark. SPARK MLIB
  • 45.
     MATLAB  R EASYTO USE BUT NOT SCALABLE  MAHOUT  GRAPHLAB Scalable but at the cost ease  Org.apache.spark.mlib Rdd based algoritms  Org.aoache.spark.ml  Pipleline api built on top of dataframes SPARK MLIB COMPETITION
  • 46.
     Loding thedata  Extracting features  Training the data  Testing  the data  The new pipeline allows tuning testing and early failure detection MACHINE LEARNING FLOW
  • 47.
     Algorithms Classifcation ex:naïve bayes Regression Linear Logistic Filtering by als ,k squares Clustering by k-means Dimensional reduction by SVD singular value decomposition  Feature extraction and transformations Tf-idf : term frequency- inverse document frequency ALGRITHMS IN MLIB
  • 48.
     Spam filtering Fraud detection  Recommendation analysis  Speech recognition PRACTICAL USE
  • 49.
     Word tovector algorithm  This algorithm takes an input text and outputs a set of vectors representing a dictionary of words [to see word similarity]  We cache the rdds because mlib will have multiple passes o the same data so this memory cache can reduce processing time alot  breeze numerical processing library used inside of spark  It has ability to perform mathematical operations on vectors MLIB DEMO
  • 50.
  • 51.
     GraphX isApache Spark's API for graphs and graph-parallel computation.  Page ranking  Producing evaluations  It can be used in genetic analysis  ALGORITHMS  PageRank  Connected components  Label propagation  SVD++  Strongly connected components  Triangle count GRAPHX - FROM A TABLE STRUCTUED LIKE TO A GRAHP STRUCTURED WORLD
  • 52.
  • 53.
     Joints eachhad unique id  Each vertex can has properties of user defined type and store metal data ARCHITECTURE
  • 54.
     Arrows arerelations that can store metadata data known as edges which is a long type  A graph is built of two RDDs one containing the collection of edges and the collection of vertices
  • 55.
     Another componentis edge triplet is an object which exposes the relation between each vertex and edge containing all the information for each connection
  • 56.
  • 58.
     http://spark.apache.org  Tutorials:http://ampcamp.berkeley.edu  Spark Summit: http://spark-summit.org  Github: https://github.com/apache/spark  https://data-flair.training/blogs/spark-sql-tutorial/ REFERENCES