www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING What to expect?  Spark Overview  Hadoop Overview  Spark vs Hadoop  Why Spark Hadoop?  Using Hadoop With Spark  Use Case  Conclusion
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark Overview
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING What is Spark?  Apache Spark is an open-source cluster-computing framework for real time processing developed by the Apache Software Foundation  Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance  It was built on top of Hadoop MapReduce and it extends the MapReduce model to efficiently use more types of computations Reduction in time Parallel Serial Figure: Data Parallelism In Spark Figure: Real Time Processing In Spark
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark Overview Polyglot: Can be programmed in Scala, Java, Python and R Spark is used in real-time processing Lazy Evaluation: Delays evaluation till needed Real time computation & low latency because of in-memory computation
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark Ecosystem Used for structured data. Can run unmodified hive queries on existing Hadoop deployment Spark Core Engine Spark SQL (SQL) Spark Streaming (Streaming) MLlib (Machine Learning) GraphX (Graph Computation) SparkR (R on Spark) Enables analytical and interactive apps for live streaming data. Package for R language to enable R-users to leverage Spark power from R shell Machine learning libraries being built on top of Spark. The core engine for entire Spark framework. Provides utilities and architecture for other components Graph Computation engine (Similar to Giraph). Combines data- parallel and graph- parallel concepts
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark Features Deployment Powerful Caching Polyglot Features 100x faster than for large scale data processing Simple programming layer provides powerful caching and disk persistence capabilities Can be deployed through Mesos, Hadoop via Yarn, or Spark’s own cluster manger Can be programmed in Scala, Java, Python and R Speed vs
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark Use Cases Twitter Sentiment Analysis With Spark Trending Topics can be used to create campaigns and attract larger audience Sentiment helps in crisis management, service adjusting and target marketing NYSE: Real Time Analysis of Stock Market Data Banking: Credit Card Fraud Detection Genomic Sequencing
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Hadoop Overview
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING What is Hadoop? Hadoop Cluster Master Slaves Hadoop is a framework that allows us to store and process large data sets in parallel and distributed fashion HDFS (Storage) MapReduce (Processing) Allows parallel processing of the data stored in HDFS Allows to dump any kind of data across the cluster
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Hadoop Ecosystem
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Hadoop Features Economical Scalability Reliability Features Flexible with all kinds of data In-built capability of integrating seamlessly with cloud based services Usage of commodity hardware minimizes the cost of ownership Hadoop infrastructure has in-built fault tolerance features Flexibility
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Hadoop Use Cases E-Commerce Data Analytics Politics: US Presidential Election Banking: Credit Card Fraud Detection Healthcare
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark vs Hadoop
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark vs Hadoop Use Cases For Real Time Analytics Banking Government Healthcare Telecommunications Stock Market Process data in real-time Easy to use Faster processing Our Requirements: Handle input from multiple sources
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark vs Hadoop  Spark runs upto 100x times faster than Hadoop.  The in-memory processing in Spark is what makes it faster than MapReduce.  Spark is not considered as a replacement but as an extension to Hadoop. 0 20 40 60 80 100 120 140 160 180 Page Rank Performance Iteration Time (s) Hadoop Basic Spark Spark + Controlled Partitioning The best case as per our chart is when Spark is used alongside Hadoop. Let us dive in and use Hadoop with Spark.
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Why to use Spark with Hadoop?
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Why Spark Hadoop? Using Spark and Hadoop together helps us to leverage Spark’s processing to utilize the best of Hadoop’s HDFS and YARN. Spark StreamingCSV Sequence File Avro Parquet HDFS Spark YARN MapReduce Storage Sources Input Data Resource Allocation Optional Processing Input Data Output Data
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Using Hadoop with Spark & Spark can be used along with MapReduce in the same Hadoop cluster or separately as a processing framework Spark applications can also be run on YARN (Hadoop NextGen) MapReduce and Spark are used together where MapReduce is used for batch processing and Spark for real-time processing Spark can run on top of HDFS to leverage the distributed replicated storage
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING YARN Deployment With Spark Figure: Cluster Deployment Mode Figure: Client Deployment Mode  In YARN-Cluster mode, the Spark driver runs inside an application master process which is managed by YARN  The client can go away after initiating the application  In YARN-Client mode, the Spark driver runs in the client process  The application master is only used for requesting resources from YARN. YARN Cluster Mode YARN Client Mode
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Use Case – Sports Analysis Using Spark Hadoop
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Use Case Kevin Durant, NBA MVP 2014Stephen Curry, NBA MVP 2015 & 2016 Joe Hassett, Highest 3 Pt Normalized LeBron James, NBA MVP ‘10, ’12 & ‘13 Problem Statement To build a Sport Analysis system using Spark Hadoop for predicting game results and player rankings for sports like Basketball, Football, Cricket, Soccer, etc. We will demonstrate the same using Basketball for our use case.
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Use Case – Flow Diagram Huge amount of Sports data 1 Data Stored in HDFS 2 Using Spark Processing for Analysis 3 Calculate Top Scorers Per Season Predict the NBA Most Valuable Player (MVP) Compare Teams to Predict Winners 4 4 Query 3 Query 1 Query 2 4 5
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Use Case – Dataset
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Use Case – Dataset Figure: Dataset from http://www.basketball-reference.com/leagues/NBA_2016_per_game.html
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Use Case – Initializing Spark Packages //Importing the necessary packages import org.apache.spark.rdd._ import org.apache.spark.rdd.RDD import org.apache.spark.util.IntParam import org.apache.spark.sql.SQLContext import org.apache.spark.sql._ import org.apache.spark.sql.functions._ import org.apache.spark.sql.types._ import org.apache.spark.util.StatCounter import org.apache.spark.sql.Row import org.apache.spark.sql.types._ import org.apache.spark.mllib.linalg.{Vector, Vectors} import scala.collection.mutable.ListBuffer import org.apache.spark.SparkContext._ import org.apache.spark.SparkContext import org.apache.spark.SparkConf import org.apache.spark.storage.StorageLevel import scala.io.Source import scala.collection.mutable.HashMap import java.io.File
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Use Case – Reading Data From HDFS //Creating an object basketball containing our main() class object basketball { def main(args: Array[String]) { val sparkConf = new SparkConf().setAppName("basketball").setMaster("local[2]") val sc = new SparkContext(sparkConf) for (i <- 1980 to 2016) { println(i) val yearStats = sc.textFile(s"hdfs://localhost:9000/basketball/BasketballStats/leagues_NBA_$i*") yearStats.filter(x => x.contains(",")).map(x => (i,x)).saveAsTextFile(s"hdfs://localhost:9000/basketball/BasketballStatsWithYear/ $i/") }
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Use Case – Parsing Data And Broadcasting //Read in all the statistics val stats=sc.textFile("hdfs://localhost:9000/basketball/BasketballStatsWithYear4/*/*") .repartition(sc.defaultParallelism) //Filter out the junk rows and clean up data for errors val filteredStats=stats.filter(line => !line.contains("FG%")).filter(line => line.contains(",")).map(line => line.replace("*","").replace(",,",",0,")) filteredStats.cache() //Parse statistics and save as Map val txtStat = Array("FG","FGA","FG%","3P","3PA","3P%","2P","2PA","2P%","eFG%","FT","FTA","FT%"," ORB","DRB","TRB","AST","STL","BLK","TOV","PF","PTS") val aggStats = processStats(filteredStats,txtStat).collectAsMap //Collect RDD into map and broadcast it into 'broadcastStats' val broadcastStats = sc.broadcast(aggStats)
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Use Case – Player Statistics Transformations //Parse stats and normalize val nStats = filteredStats.map(x=>bbParse(x,broadcastStats.value,zBroadcastStats.value)) //Parse stats and track weights val txtStatZ = Array("FG","FT","3P","TRB","AST","STL","BLK","TOV","PTS") val zStats = processStats(filteredStats,txtStatZ,broadcastStats.value).collectAsMap //Collect RDD into Map and broadcast into 'zBroadcastStats' val zBroadcastStats = sc.broadcast(zStats) //Map RDD to RDD[Row] so that we can turn it into a DataFrame val nPlayer = nStats.map(x => Row.fromSeq(Array(x.name,x.year,x.age,x.position,x.team,x.gp,x.gs,x.mp) ++ x.stats ++ x.statsZ ++ Array(x.valueZ) ++ x.statsN ++ Array(x.valueN)))
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Use Case – Querying through Spark SQL
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Use Case – Getting All Player Statistics //create schema for the data frame val schemaN = StructType( StructField("name", StringType, true) :: StructField("year", IntegerType, true) :: ... StructField("nTOT", DoubleType, true) :: Nil ) //Create DataFrame 'dfPlayersT' and register as 'tPlayers' val sqlContext = new org.apache.spark.sql.SQLContext(sc) val dfPlayersT = sqlContext.createDataFrame(nPlayer,schemaN) dfPlayersT.registerTempTable("tPlayers") //Create DataFrame 'dfPlayers' and register as 'Players' val dfPlayers = sqlContext.sql("select age-min_age as exp,tPlayers.* from tPlayers join (select name,min(age)as min_age from tPlayers group by name) as t1 on tPlayers.name=t1.name order by tPlayers.name, exp ") dfPlayers.registerTempTable("Players")
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Use Case – Storing Best Players Into HDFS //Calculate the best players of 2016 val mvp = sqlContext.sql("Select name, zTot from Players where year=2016 order by zTot desc").cache mvp.show //Storing the best players of 2016 into HDFS mvp.write.format("csv").save("hdfs://localhost:9000/basketball/output.csv") //Listing the full numbers of LeBron James sqlContext.sql("Select * from Players where year=2016 and name='LeBron James'").collect.foreach(println) //Ranking the top 10 players on the average 3 pointers scored per game in 2016 sqlContext.sql("select name, 3p, z3p from Players where year=2016 order by z3p desc").take(10).foreach(println)
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Use Case –Storing Best Players Into HDFS Best Player Of 2016 Most 3 Pointers In 2016 All Stats Of LeBron James
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Use Case – Sample Result File in HDFS Figure: Output file containing top NBA players of 2016 Figure: Output directory in HDFS file system Output directory path
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Use Case – Highest 3 Point Shooters //All time 3 point shooting ranking sqlContext.sql("select name, 3p, z3p from Players order by 3p desc").take(10).foreach(println) //All time 3 point shooting ranking normalized to their leagues sqlContext.sql("select name, 3p, z3p from Players order by z3p desc").take(10).foreach(println) //Calculate the average number of 3 pointers per game in 2016 broadcastStats.value("2016_3P_avg") //Calculate the average number of 3 pointers per game in 1981 broadcastStats.value("1981_3P_avg")
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Use Case – Highest 3 Point Shooters Best All Time 3 Point Shooter Best All Time 3 Point Shooter Normalized To Their Season
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Use Case – Prediction Analysis Results
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Use Case – Who Will Be The 2016 NBA MVP? LeBron James Stephen CurryJames Harden Russell WestbrookKobe BryantDwayne Wade sqlContext.sql("select name, zTot from Players where year=2016 order by zTot desc").take(10).foreach(println)
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Use Case – Predicting MVP 2016 As our model predicts, Stephen Curry is the MVP of NBA in 2016. Hell Yeah! It matched with the actual NBA MVP of 2016.
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Summary
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Summary Spark Overview YARN Spark DeploymentWhy Spark Hadoop? Hadoop Overview Sport Analysis Spark vs Hadoop
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Conclusion
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Conclusion Congrats! We have hence demonstrated the power of Spark Hadoop in Prediction Analytics. The hands-on examples will give you the required confidence to work on any future projects you encounter in Apache Spark and Hadoop.
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Thank You … Questions/Queries/Feedback

Spark Hadoop Tutorial | Spark Hadoop Example on NBA | Apache Spark Training | Edureka