Getting Started with Apache Spark Presented By Manish Mishra Pradyuman Pratap Singh
Lack of etiquette and manners is a huge turn off. KnolX Etiquettes  Punctuality Join the session 5 minutes prior to the session start time. We start on time and conclude on time!  Feedback Make sure to submit a constructive feedback for all sessions as it is very helpful for the presenter.  Silent Mode Keep your mobile devices in silent mode, feel free to move out of session in case you need to attend an urgent call.  Avoid Disturbance Avoid unwanted chit chat during the session.
1. Introduction to Big Data and Apache Spark  What is Big Data?  What is Apache Spark?  Features of Apache Spark 2. Overview of Spark Architecture 3. Spark Components 4. Spark Basic & Programming Model  Spark Context  Spark Session  RDD  Dataframe  RDD v/s Dataframe 5. Advantages of Apache Spark 6. Disadvantages of Apache Spark 7. Demo
What is Big Data? Big Data means very large and complex sets of information that are too big and fast for traditional computer systems to handle. It includes a wide variety of data types from many sources. It is characterized by the 5 Vs:  Volume: Massive amounts of data.  Velocity: Speed at which data is generated and processed.  Variety: Different types of data (structured, semi-structured, unstructured).  Veracity: Data quality and accuracy.  Value: Value the data provides.
What is Apache Spark?  Apache Spark is an open-source analytical processing engine for large-scale powerful distributed data processing and machine learning applications. It can handle both batches as well as real-time analytics and data processing workloads.  It is based on Hadoop MapReduce and it extends the MapReduce model to efficiently use it for more types of computations, which includes interactive queries and stream processing.  The main feature of Spark is its in-memory computing that increases the processing speed of an application.
Features of Apache Spark 01 02 03 05 06 04 In Memory Computation Speed Different Cluster Managers Distributed Processing Fault Tolerant Lazy Evaluation
02
Apache Spark Architecture
03
Spark Components Spark Core Spark SQL Supported Languages Spark Streaming Real Time Mlib Machine Learning GraphX Graph Processing Scala Java Python R Spark Engine Libraries
04
Spark Basics 1. Spark Context: SparkContext is the primary entry point to any spark functionality. When we run any Spark application, a driver program starts, which has the main function and your SparkContext gets initiated here. The driver program then runs the operations inside the executors on worker nodes. 2. Spark Session: SparkSession is a unified entry point for Spark applications; it was introduced in Spark 2.0. It acts as a connector to all Spark’s underlying functionalities, including RDDs, DataFrames, and Datasets, providing a unified interface to work with structured data processing.
RDD  Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an immutable distributed collection of objects. Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster.  There are two ways to create RDDs − parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared file system, HDFS, HBase, or any data source offering a Hadoop Input Format. RDD Operation: o Transformation o Actions
Dataframe  In Spark, Dataframe are the distributed collections of data, organized into rows and columns. Each column in a Dataframe has a name and an associated type. Dataframe are like traditional database tables, which are structured and concise.  We can say that Dataframe are relational databases with better optimization techniques.  Spark Dataframe can be created from various sources, such as Hive tables, log tables, external databases, or the existing RDDs. Dataframe allow the processing of huge amounts of data.
RDD v/s Dataframe Features RDD Dataframe Data Format Structured and unstructured Structured and semi-structured APIs Provide a low-level API that requires more code to perform transformations and actions on data Provide a high-level API that makes it easier to perform transformations and actions on data. Schema enforcement Do not have an explicit schema, and are often used for unstructured data. Dataframe enforce schema at runtime. Have an explicit schema that describes the data and its types. Optimization No inbuilt optimization engine is available in RDD. It uses a catalyst optimizer for optimization.
05
Advantages of Apache Spark  In Memory Computation  Speed  Ease of Use  Advanced Analytics  Fault Tolerant  Multi Language Support
06
Disadvantages of Apache Spark  Small Files Issue  File Management System  No automatic optimization process  Fewer Algorithms
07
Getting Started with Apache Spark (Scala)

Getting Started with Apache Spark (Scala)

  • 1.
    Getting Started with Apache Spark PresentedBy Manish Mishra Pradyuman Pratap Singh
  • 2.
    Lack of etiquetteand manners is a huge turn off. KnolX Etiquettes  Punctuality Join the session 5 minutes prior to the session start time. We start on time and conclude on time!  Feedback Make sure to submit a constructive feedback for all sessions as it is very helpful for the presenter.  Silent Mode Keep your mobile devices in silent mode, feel free to move out of session in case you need to attend an urgent call.  Avoid Disturbance Avoid unwanted chit chat during the session.
  • 3.
    1. Introduction toBig Data and Apache Spark  What is Big Data?  What is Apache Spark?  Features of Apache Spark 2. Overview of Spark Architecture 3. Spark Components 4. Spark Basic & Programming Model  Spark Context  Spark Session  RDD  Dataframe  RDD v/s Dataframe 5. Advantages of Apache Spark 6. Disadvantages of Apache Spark 7. Demo
  • 5.
    What is BigData? Big Data means very large and complex sets of information that are too big and fast for traditional computer systems to handle. It includes a wide variety of data types from many sources. It is characterized by the 5 Vs:  Volume: Massive amounts of data.  Velocity: Speed at which data is generated and processed.  Variety: Different types of data (structured, semi-structured, unstructured).  Veracity: Data quality and accuracy.  Value: Value the data provides.
  • 6.
    What is ApacheSpark?  Apache Spark is an open-source analytical processing engine for large-scale powerful distributed data processing and machine learning applications. It can handle both batches as well as real-time analytics and data processing workloads.  It is based on Hadoop MapReduce and it extends the MapReduce model to efficiently use it for more types of computations, which includes interactive queries and stream processing.  The main feature of Spark is its in-memory computing that increases the processing speed of an application.
  • 7.
    Features of ApacheSpark 01 02 03 05 06 04 In Memory Computation Speed Different Cluster Managers Distributed Processing Fault Tolerant Lazy Evaluation
  • 8.
  • 9.
  • 10.
  • 11.
    Spark Components Spark Core SparkSQL Supported Languages Spark Streaming Real Time Mlib Machine Learning GraphX Graph Processing Scala Java Python R Spark Engine Libraries
  • 12.
  • 13.
    Spark Basics 1. SparkContext: SparkContext is the primary entry point to any spark functionality. When we run any Spark application, a driver program starts, which has the main function and your SparkContext gets initiated here. The driver program then runs the operations inside the executors on worker nodes. 2. Spark Session: SparkSession is a unified entry point for Spark applications; it was introduced in Spark 2.0. It acts as a connector to all Spark’s underlying functionalities, including RDDs, DataFrames, and Datasets, providing a unified interface to work with structured data processing.
  • 14.
    RDD  Resilient DistributedDatasets (RDD) is a fundamental data structure of Spark. It is an immutable distributed collection of objects. Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster.  There are two ways to create RDDs − parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared file system, HDFS, HBase, or any data source offering a Hadoop Input Format. RDD Operation: o Transformation o Actions
  • 15.
    Dataframe  In Spark,Dataframe are the distributed collections of data, organized into rows and columns. Each column in a Dataframe has a name and an associated type. Dataframe are like traditional database tables, which are structured and concise.  We can say that Dataframe are relational databases with better optimization techniques.  Spark Dataframe can be created from various sources, such as Hive tables, log tables, external databases, or the existing RDDs. Dataframe allow the processing of huge amounts of data.
  • 16.
    RDD v/s Dataframe FeaturesRDD Dataframe Data Format Structured and unstructured Structured and semi-structured APIs Provide a low-level API that requires more code to perform transformations and actions on data Provide a high-level API that makes it easier to perform transformations and actions on data. Schema enforcement Do not have an explicit schema, and are often used for unstructured data. Dataframe enforce schema at runtime. Have an explicit schema that describes the data and its types. Optimization No inbuilt optimization engine is available in RDD. It uses a catalyst optimizer for optimization.
  • 17.
  • 18.
    Advantages of ApacheSpark  In Memory Computation  Speed  Ease of Use  Advanced Analytics  Fault Tolerant  Multi Language Support
  • 19.
  • 20.
    Disadvantages of ApacheSpark  Small Files Issue  File Management System  No automatic optimization process  Fewer Algorithms
  • 21.