www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING What to expect?  Use Cases Of Real Time Analytics  Movie Recommendation System Using Spark  What Is Spark?  Getting Movie Dataset  Spark Streaming  Collaborative Filtering  Spark MLlib  Fetching Results  Storing Results
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Use Cases of Real Time Analytics
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Use Cases of Real Time Analytics Government  Government agencies perform Real Time Analysis mostly in the field of national security.  Countries need to continuously keep a track of all the military and police agencies for updates regarding threats to security.
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Use Cases of Real Time Analytics Healthcare  Healthcare domain uses Real Time analysis to continuously check the medical status of critical patients.  Hospitals on the look out for blood and organ transplants need to stay in a real-time contact with each other during emergencies.  Getting medical attention on time is a matter of life and death for patients.
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Use Cases of Real Time Analytics Telecommunications  Companies revolving around services in the form of calls, video chats and streaming use real time analysis to reduce customer churn and stay ahead of competition.  They also extract measurements of jitter and delay in mobile networks to improve customer experiences.
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Use Cases of Real Time Analytics Banking  Banking transacts with almost all of the world’s money.  It becomes very important to ensure fault tolerant transactions across the whole system.  Fraud detection is made possible through real time analytics in banking.
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Use Cases of Real Time Analytics Stock Market  Stock brokers use real time analytics to predict movement of stock portfolios.  Companies re-think their business model after using real time analytics to analyze the market demand for their brand.
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Movie Recommendation System
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Movie Recommendation System
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Movie Recommendation System Problem Statement To build a Movie Recommendation System which recommends movies based on a user’s preferences using Apache Spark. Process huge amount of data Easy to use Fast processing Our Requirements: Input from multiple sources Apache Spark is the perfect tool to implement our Movie Recommendation System.
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Use Case – Flow Diagram Huge amount of Movie Rating data 1 Data from Streaming / HDFS 2 Getting Input using Spark Streaming 3 4 Machine Learning Using MLlib Train the data Evaluate ALS Generate Recommendations Fetching Results using Spark SQL 5 Storing Results in RDBMS System for Websites 6
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING What is Spark?
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING What is Spark?  Apache Spark is an open-source cluster-computing framework for real time processing developed by the Apache Software Foundation.  Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance.  It was built on top of Hadoop MapReduce and it extends the MapReduce model to efficiently use more types of computations. Reduction in time Parallel Serial Figure: Data Parallelism In Spark Figure: Real Time Processing In Spark Figure: Support for multiple source formats Figure: Lazy Evaluation
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark Features Deployment Powerful Caching Polyglot Features 100x faster than for large scale data processing Simple programming layer provides powerful caching and disk persistence capabilities Can be deployed through Mesos, Hadoop via Yarn, or Spark’s own cluster manger Can be programmed in Scala, Java, Python and R Speed vs
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Movie Dataset
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Movie Dataset User Ratings from BookMyShow
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Movie Dataset Movie Ratings In Our Dataset
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Getting Dataset
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Getting Dataset  For our Movie Recommendation System, we can get user ratings from many popular websites like IMDB, Rotten Tomatoes and Times Movie Ratings.  This dataset is available in many formats such as CSV files, text files and databases.  We can either stream the data live from the websites or download and store them in our local file system or HDFS. Figure: Various File Formats
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark Streaming
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark Streaming
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark Streaming  Spark Streaming is used for processing real-time streaming data  Spark Streaming enables high-throughput and fault-tolerant stream processing of live data streams
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Collaborative Filtering
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Collaborative Filtering  We will use Collaborative Filtering (CF) to predict the ratings for users for particular movies based on their ratings for other movies.  We then collaborate this with other users’ rating for that particular movie. Movie Alice Bob Carol Dave Shutter Island 4 3 5 1 Fight Club 5 4 4 2 Dark Knight 5 3 4 21 4 3 5 Home Alone 4 4 5 5 Figure: Predicting the rating of Dave for Dark Knight and Carol for 21 using Collaborative Filtering ? ?
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark MLlib
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark MLlib  Spark MLlib is used to perform machine learning in Apache Spark.  Machine learning in Spark is implemented using Spark’s MLlib.  MLlib stands for Machine Learning Library. Train Data using Alternating Least Squares (ALS) Generate Recommendations using Collaborative Filtering Machine Learning Using Spark MLlib Figure: Machine Learning Flow Diagram Machine Learning Tools ML Algorithms Featurization Pipelines Persistence Utilities Figure: Machine Learning Tools
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Fetching Results
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark SQL for Fetching Results Machine Learning Output Spark SQL Results  To get the results from our Machine Learning, we need to use Spark SQL’s DataFrame, Dataset and SQL Service.  The results in Machine Learning needs to be stored in a RDBMS so that our web application can display the recommendations to a particular use.
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Ratings for Movies Ratings of Movies for User 77 Figure: User 77’s ratings for different movies
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Recommended Movies Total Number of Recommendations for User 77 Top Movie Recommendations for User 77 Figure: Movies recommended for User 77
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Storing Results
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Storing Results  The results for our Movie Recommendation System can be stored either locally or into external storage systems.  We can store the Recommended Movies along with the Ratings in a text file or a CSV file.  We should prefer storing the results into an RDBMS system so that we can access it directly from our web application and display recommendations and top movies.
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark Job Trends
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Spark Job Trends  The following is the Job Trend of Apache Spark across the world.  Spark has almost thrice the average number of jobs in comparison to its competitors and is the market leader from 2014. Source: www.indeed.com
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Summary
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Summary Real Time Analytics Spark MLlibSpark Streaming Movie Recommendation System Spark Job Trends What is Spark?
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Conclusion
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Conclusion Congrats! We have hence demonstrated the power of Apache Spark in Real-Time Analysis. The hands-on examples will give you the required confidence to work on any future projects you encounter in Apache Spark.
www.edureka.co/apache-spark-scala-trainingEDUREKA SPARK CERTIFICATION TRAINING Thank You … Questions/Queries/Feedback

Apache Spark Training | Spark Tutorial For Beginners | Apache Spark Certification | Edureka