www.edureka.co/r-for-analytics www.edureka.co/apache-spark-scala-training Apache Spark: Beyond Hadoop MapReduce Presenter: Vishal
Slide 2Slide 2Slide 2 www.edureka.co/apache-spark-scala-training What will you learn today?  Strength of MapReduce  Limitations of MapReduce  How MapReduce limitations can be overcome  How Spark fits the bill  Other exciting features in Spark
Strength of MapReduce
Slide 4Slide 4Slide 4 www.edureka.co/apache-spark-scala-training Simple Scalable Fault Tolerant Minimal data motion Strength of MapReduce Independent of a programming language, such as Java, C++ or Python. It can process petabytes of data, stored in HDFS on one cluster MapReduce takes care of failures using the replicated copies. Process moves towards data to minimize Disk I/O
Limitations of MapReduce
Slide 6Slide 6Slide 6 www.edureka.co/apache-spark-scala-training Real Time Complex Algorithm Re-reading and parsing Data Minimal Data Motion Graph Processing Iterative Tasks Random Access Limitations Of MR
Slide 7Slide 7Slide 7 www.edureka.co/apache-spark-scala-training Feature Comparison with Spark Fast 100x faster than MapReduce Batch Processing Batch and Real-time Processing Stores Data on Disk Stores Data in Memory Written in Java Written in Scala Hadoop MapReduce Hadoop Spark Source: Databrix
What are the MR limitations and how Spark overcomes it?
Slide 9Slide 9Slide 9 www.edureka.co/apache-spark-scala-training Overcoming MR limitations By Cutting down on the number of Reads and Writes to the disc Real time
Slide 10Slide 10Slide 10 www.edureka.co/apache-spark-scala-training Spark tries to keep things in-memory of its distributed workers, allowing for significantly faster/lower-latency computations, whereas MapReduce keeps shuffling things in and out of disk. Spark Cuts Down Read/Write I/O To Disk
Slide 11Slide 11Slide 11 www.edureka.co/apache-spark-scala-training Overcoming MR limitations Libraries for Machine Learning & Streaming Graph processing Complex algorithm
Slide 12Slide 12Slide 12 www.edureka.co/apache-spark-scala-training Libraries For ML, Graph Programming … Machine Learning Library Graph programming Spark interface For RDBMS lovers Utility for continuous ingestion of data
Slide 13Slide 13Slide 13 www.edureka.co/apache-spark-scala-training Overcoming MR limitations Cyclic data flows Random access
Slide 14Slide 14Slide 14 www.edureka.co/apache-spark-scala-training Cyclic Data Flows • All jobs in spark comprise a series of operators and run on a set of data. • All the operators in a job are used to construct a DAG (Directed Acyclic Graph). • The DAG is optimized by rearranging and combining operators where possible.
Slide 15Slide 15Slide 15 www.edureka.co/apache-spark-scala-training Spark Features makes its Architecture better than MR
Other Spark Features In Demand
Slide 17Slide 17Slide 17 www.edureka.co/apache-spark-scala-training Spark Features/Modules In Demand Source: Typesafe
Slide 18Slide 18Slide 18 www.edureka.co/apache-spark-scala-training New Features In 2015 Data Frames  • Similar API to data frames in R and Pandas • Automatically optimised via Spark SQL • Released in Spark 1.3 SparkR  • Released in Spark 1.4 • Exposes DataFrames, RDD’s & MLlibrary in R Machine Learning Pipelines  • High Level API • Featurization • Evaluation • Model Tuning External Data Sources  • Platform API to plug Data-Sources into Spark • Pushes logic into sources Source: Databrix
Slide 19Slide 19Slide 19 www.edureka.co/apache-spark-scala-training Get Certified in Spark from Edureka Edureka's Spark and Scala course: • Learn large-scale data processing by mastering the concepts of Scala, RDD, Traits, OOPS and Spark SQL • Online Live Courses: 24 hours • Assignments: 32 hours • Project: 20 hours • Lifetime Access + 24 X 7 Support Go to www.edureka.co/apache-spark-scala-training Batch starts from 10th October (Weekend Batch)
Thank You Questions/Queries/Feedback/Survey Recording and presentation will be made available to you within 24 hours

Apache Spark beyond Hadoop MapReduce

  • 1.
  • 2.
    Slide 2Slide 2Slide2 www.edureka.co/apache-spark-scala-training What will you learn today?  Strength of MapReduce  Limitations of MapReduce  How MapReduce limitations can be overcome  How Spark fits the bill  Other exciting features in Spark
  • 3.
  • 4.
    Slide 4Slide 4Slide4 www.edureka.co/apache-spark-scala-training Simple Scalable Fault Tolerant Minimal data motion Strength of MapReduce Independent of a programming language, such as Java, C++ or Python. It can process petabytes of data, stored in HDFS on one cluster MapReduce takes care of failures using the replicated copies. Process moves towards data to minimize Disk I/O
  • 5.
  • 6.
    Slide 6Slide 6Slide6 www.edureka.co/apache-spark-scala-training Real Time Complex Algorithm Re-reading and parsing Data Minimal Data Motion Graph Processing Iterative Tasks Random Access Limitations Of MR
  • 7.
    Slide 7Slide 7Slide7 www.edureka.co/apache-spark-scala-training Feature Comparison with Spark Fast 100x faster than MapReduce Batch Processing Batch and Real-time Processing Stores Data on Disk Stores Data in Memory Written in Java Written in Scala Hadoop MapReduce Hadoop Spark Source: Databrix
  • 8.
    What are theMR limitations and how Spark overcomes it?
  • 9.
    Slide 9Slide 9Slide9 www.edureka.co/apache-spark-scala-training Overcoming MR limitations By Cutting down on the number of Reads and Writes to the disc Real time
  • 10.
    Slide 10Slide 10Slide10 www.edureka.co/apache-spark-scala-training Spark tries to keep things in-memory of its distributed workers, allowing for significantly faster/lower-latency computations, whereas MapReduce keeps shuffling things in and out of disk. Spark Cuts Down Read/Write I/O To Disk
  • 11.
    Slide 11Slide 11Slide11 www.edureka.co/apache-spark-scala-training Overcoming MR limitations Libraries for Machine Learning & Streaming Graph processing Complex algorithm
  • 12.
    Slide 12Slide 12Slide12 www.edureka.co/apache-spark-scala-training Libraries For ML, Graph Programming … Machine Learning Library Graph programming Spark interface For RDBMS lovers Utility for continuous ingestion of data
  • 13.
    Slide 13Slide 13Slide13 www.edureka.co/apache-spark-scala-training Overcoming MR limitations Cyclic data flows Random access
  • 14.
    Slide 14Slide 14Slide14 www.edureka.co/apache-spark-scala-training Cyclic Data Flows • All jobs in spark comprise a series of operators and run on a set of data. • All the operators in a job are used to construct a DAG (Directed Acyclic Graph). • The DAG is optimized by rearranging and combining operators where possible.
  • 15.
    Slide 15Slide 15Slide15 www.edureka.co/apache-spark-scala-training Spark Features makes its Architecture better than MR
  • 16.
  • 17.
    Slide 17Slide 17Slide17 www.edureka.co/apache-spark-scala-training Spark Features/Modules In Demand Source: Typesafe
  • 18.
    Slide 18Slide 18Slide18 www.edureka.co/apache-spark-scala-training New Features In 2015 Data Frames  • Similar API to data frames in R and Pandas • Automatically optimised via Spark SQL • Released in Spark 1.3 SparkR  • Released in Spark 1.4 • Exposes DataFrames, RDD’s & MLlibrary in R Machine Learning Pipelines  • High Level API • Featurization • Evaluation • Model Tuning External Data Sources  • Platform API to plug Data-Sources into Spark • Pushes logic into sources Source: Databrix
  • 19.
    Slide 19Slide 19Slide19 www.edureka.co/apache-spark-scala-training Get Certified in Spark from Edureka Edureka's Spark and Scala course: • Learn large-scale data processing by mastering the concepts of Scala, RDD, Traits, OOPS and Spark SQL • Online Live Courses: 24 hours • Assignments: 32 hours • Project: 20 hours • Lifetime Access + 24 X 7 Support Go to www.edureka.co/apache-spark-scala-training Batch starts from 10th October (Weekend Batch)
  • 20.
    Thank You Questions/Queries/Feedback/Survey Recording andpresentation will be made available to you within 24 hours