www.edureka.co/data-science Top 5 Algorithms Used in Data Science
Slide 2 www.edureka.co/data-science What are we going to learn today ? At the end of the session you will be able to understand :  What is Data Science  What does Data Scientists do  Top 5 Data Science Algorithms  Decision Tree  Random Forest  Association Rule Mining  Linear Regression  K-Means Clustering  Demo on K-Means Clustering algorithm
Slide 3 www.edureka.co/data-science Data Science
Slide 4 www.edureka.co/data-science What is Data Science ? Data science is nothing but extracting meaningful and actionable knowledge from data
Slide 5 www.edureka.co/data-science Who are Data Scientists ? Basically data scientists are humans who have multitude of skills and who love playing with data
Slide 6 www.edureka.co/data-science Data Science from 1000 feet Data Science Visualization Data Engineering Statistics Advanced Computing Domain Expertise
Slide 7 www.edureka.co/data-science Arsenal of a Data Scientist Data Science Data Architecture Tool: Hadoop Machine Learning Tool: Mahout, Weka, Spark MLlib Analytics Tool: R, Python Note that evaluating different machine learning algorithms is a daily work of a data scientist. So it becomes very important for a data scientist to have a good grip over various machine learning algorithms.
Slide 8 www.edureka.co/data-science Machine Learning Machine Learning is a method of teaching computers to make and improve predictions based on data Machine learning is a huge field, with hundreds of different algorithms for solving myriad different problems Supervised Learning : The categories of the data is already known Unsupervised Learning : The learning process attempts to find appropriate category for the data
Slide 9 www.edureka.co/data-science Decision TreeDecision Tree
Slide 10 www.edureka.co/data-science Decision Tree Example Training Data
Slide 11 www.edureka.co/data-science Decision Tree, Root : Student Step-1 Student
Slide 12 www.edureka.co/data-science Decision Tree, Root : Student Step-2 Student Income Income Medium
Slide 13 www.edureka.co/data-science Decision Tree, Root : Student Step-3 Student Income Income YES YES Medium
Slide 14 www.edureka.co/data-science Decision Tree, Root : Student Student Income Income Age CR CR YES YES31….40 Medium Step-4
Slide 15 www.edureka.co/data-science Decision Tree, Root : Student Student Income Income Age CR CR No Yes Yes Yes Yes 31….40 Medium Step-5
Slide 16 www.edureka.co/data-science Decision Tree, Root : Student Student Income Income Age CR No Yes 31….40 Age Age Yes No No Yes 31….40 CR Age Yes No > 40 31….40 Yes Yes Yes Fair Medium Step-6
Slide 17 www.edureka.co/data-science Decision Tree, Root : Student  1. student(no)^income(high)^age(<=30) => buys_computer(no)  2. student(no)^income(high)^age(31…40) => buys_computer(yes)  3. student(no)^income(medium)^CR(fair)^age(>40) => buys_computer(yes)  4. student(no)^income(medium)^CR(fair)^age(<=30) => buys_computer(no)  5. student(no)^income(medium)^CR(excellent)^age(>40) => buys_computer(no)  6. student(no)^income(medium)^CR(excellent)^age(31..40) =>buys_computer(yes)  7. student(yes)^income(low)^CR(fair) => buys_computer(yes)  8. student(yes)^income(low)^CR(excellent)^age(31..40) => buys_computer(yes)  9. student(yes)^income(low)^CR(excellent)^age(>40) => buys_computer(no)  10. student(yes)^income(medium)=> buys_computer(yes)  11. student(yes)^income(high)=> buys_computer(yes) Classification rules :
Slide 18 www.edureka.co/data-science Random ForestRandom Forest
Slide 19 www.edureka.co/data-science Random Forest : Example Suppose you're very indecisive about watching a movie. “Edge of Tomorrow” You can do one of the following : 1. Either you ask your best friend, whether you will like the movie. 2. Or You can ask your group of friends.
Slide 20 www.edureka.co/data-science Random Forest : Example In order to answer, your best friend first needs to figure out what movies you like, so you give her a bunch of movies and tell her whether you liked each one or not (i.e., you give her a labelled training set) Example: Do you like movies starring Emily Blunt ? Ask Best Friend Is it based on a true incident? Does Emily Blunt star in it? No Is she the main lead? Yes, You will like the movie No Yes No, You will not like the movie No, You will not like the movie
Slide 21 www.edureka.co/data-science Random Forest : Example But your best friend might not always generalize your preferences very well (i.e., she overfits) In order to get more accurate recommendations, you'd like to ask a bunch of your friends e.g. Friend#1, Friend#2, and Friend#3 and they vote on whether you will like a movie The majority of the votes will decide the final outcome
Slide 22 www.edureka.co/data-science Random Forest : Example You didn’t like ‘Far and away’ You liked ‘Oblivion’ You like action movies You like Tom Cruise You like his pairing with Emily Blunt Yes, You will like the movie Yes, You will like the movie Yes, You will like the movie Friend 2 You did not like ‘Top Gun’ You loved ‘Godzilla’ Friend 1 No, You will not like the movie Yes, You will like the movie You hate Tom Cruise Friend 3 No, You will not like the movie
Slide 23 www.edureka.co/data-science What is Random Forest ? Random Forest is an ensemble classifier made using many decision tree models. What are ensemble models?  Ensemble models combine the results from different models.  The result from an ensemble model is usually better than the result from one of the individual models.
Slide 24 www.edureka.co/data-science Association Rule MiningAssociation Rule Mining
Slide 25 www.edureka.co/data-science Association Rule Mining
Slide 26 www.edureka.co/data-science Association Rule Mining  Association Rule Mining is a popular and well researched method for discovering interesting relations between variables in large data.  The rule found in the sales data of a supermarket would indicate that if a customer buys onions and potatoes together, he or she is likely to also buy hamburger meat.
Slide 27 www.edureka.co/data-science Linear RegressionLinear Regression
Slide 28 www.edureka.co/data-science Regression Analysis – Linear Regression Regression analysis helps understand how value of dependent variable changes when any one of independent variable changes, while other independent variables are kept fixed Linear Regression is the most popular algorithm used for prediction and forecasting
Slide 29 www.edureka.co/data-science K-Means ClusteringK-Means Clustering
Slide 30 www.edureka.co/data-science K-Means Clustering The process by which objects are classified into a number of groups so that they are as much dissimilar as possible from one group to another group, but as much similar as possible within each group. The objects in group 1 should be as similar as possible. But there should be much difference between objects in different groups The attributes of the objects are allowed to determine which objects should be grouped together. Total population Group 1 Group 2 Group 3 Group 4
Slide 31 www.edureka.co/data-science Hands-On Demo K-Means Clustering
Slide 32 Course Url Thank You … Questions/Queries/Feedback Recording and presentation will be made available to you within 24 hours

Top 5 algorithms used in Data Science

  • 1.
  • 2.
    Slide 2 www.edureka.co/data-science Whatare we going to learn today ? At the end of the session you will be able to understand :  What is Data Science  What does Data Scientists do  Top 5 Data Science Algorithms  Decision Tree  Random Forest  Association Rule Mining  Linear Regression  K-Means Clustering  Demo on K-Means Clustering algorithm
  • 3.
  • 4.
    Slide 4 www.edureka.co/data-science Whatis Data Science ? Data science is nothing but extracting meaningful and actionable knowledge from data
  • 5.
    Slide 5 www.edureka.co/data-science Whoare Data Scientists ? Basically data scientists are humans who have multitude of skills and who love playing with data
  • 6.
    Slide 6 www.edureka.co/data-science DataScience from 1000 feet Data Science Visualization Data Engineering Statistics Advanced Computing Domain Expertise
  • 7.
    Slide 7 www.edureka.co/data-science Arsenalof a Data Scientist Data Science Data Architecture Tool: Hadoop Machine Learning Tool: Mahout, Weka, Spark MLlib Analytics Tool: R, Python Note that evaluating different machine learning algorithms is a daily work of a data scientist. So it becomes very important for a data scientist to have a good grip over various machine learning algorithms.
  • 8.
    Slide 8 www.edureka.co/data-science MachineLearning Machine Learning is a method of teaching computers to make and improve predictions based on data Machine learning is a huge field, with hundreds of different algorithms for solving myriad different problems Supervised Learning : The categories of the data is already known Unsupervised Learning : The learning process attempts to find appropriate category for the data
  • 9.
  • 10.
  • 11.
    Slide 11 www.edureka.co/data-science DecisionTree, Root : Student Step-1 Student
  • 12.
    Slide 12 www.edureka.co/data-science DecisionTree, Root : Student Step-2 Student Income Income Medium
  • 13.
    Slide 13 www.edureka.co/data-science DecisionTree, Root : Student Step-3 Student Income Income YES YES Medium
  • 14.
    Slide 14 www.edureka.co/data-science DecisionTree, Root : Student Student Income Income Age CR CR YES YES31….40 Medium Step-4
  • 15.
    Slide 15 www.edureka.co/data-science DecisionTree, Root : Student Student Income Income Age CR CR No Yes Yes Yes Yes 31….40 Medium Step-5
  • 16.
    Slide 16 www.edureka.co/data-science DecisionTree, Root : Student Student Income Income Age CR No Yes 31….40 Age Age Yes No No Yes 31….40 CR Age Yes No > 40 31….40 Yes Yes Yes Fair Medium Step-6
  • 17.
    Slide 17 www.edureka.co/data-science DecisionTree, Root : Student  1. student(no)^income(high)^age(<=30) => buys_computer(no)  2. student(no)^income(high)^age(31…40) => buys_computer(yes)  3. student(no)^income(medium)^CR(fair)^age(>40) => buys_computer(yes)  4. student(no)^income(medium)^CR(fair)^age(<=30) => buys_computer(no)  5. student(no)^income(medium)^CR(excellent)^age(>40) => buys_computer(no)  6. student(no)^income(medium)^CR(excellent)^age(31..40) =>buys_computer(yes)  7. student(yes)^income(low)^CR(fair) => buys_computer(yes)  8. student(yes)^income(low)^CR(excellent)^age(31..40) => buys_computer(yes)  9. student(yes)^income(low)^CR(excellent)^age(>40) => buys_computer(no)  10. student(yes)^income(medium)=> buys_computer(yes)  11. student(yes)^income(high)=> buys_computer(yes) Classification rules :
  • 18.
  • 19.
    Slide 19 www.edureka.co/data-science RandomForest : Example Suppose you're very indecisive about watching a movie. “Edge of Tomorrow” You can do one of the following : 1. Either you ask your best friend, whether you will like the movie. 2. Or You can ask your group of friends.
  • 20.
    Slide 20 www.edureka.co/data-science RandomForest : Example In order to answer, your best friend first needs to figure out what movies you like, so you give her a bunch of movies and tell her whether you liked each one or not (i.e., you give her a labelled training set) Example: Do you like movies starring Emily Blunt ? Ask Best Friend Is it based on a true incident? Does Emily Blunt star in it? No Is she the main lead? Yes, You will like the movie No Yes No, You will not like the movie No, You will not like the movie
  • 21.
    Slide 21 www.edureka.co/data-science RandomForest : Example But your best friend might not always generalize your preferences very well (i.e., she overfits) In order to get more accurate recommendations, you'd like to ask a bunch of your friends e.g. Friend#1, Friend#2, and Friend#3 and they vote on whether you will like a movie The majority of the votes will decide the final outcome
  • 22.
    Slide 22 www.edureka.co/data-science RandomForest : Example You didn’t like ‘Far and away’ You liked ‘Oblivion’ You like action movies You like Tom Cruise You like his pairing with Emily Blunt Yes, You will like the movie Yes, You will like the movie Yes, You will like the movie Friend 2 You did not like ‘Top Gun’ You loved ‘Godzilla’ Friend 1 No, You will not like the movie Yes, You will like the movie You hate Tom Cruise Friend 3 No, You will not like the movie
  • 23.
    Slide 23 www.edureka.co/data-science Whatis Random Forest ? Random Forest is an ensemble classifier made using many decision tree models. What are ensemble models?  Ensemble models combine the results from different models.  The result from an ensemble model is usually better than the result from one of the individual models.
  • 24.
    Slide 24 www.edureka.co/data-science AssociationRule MiningAssociation Rule Mining
  • 25.
  • 26.
    Slide 26 www.edureka.co/data-science AssociationRule Mining  Association Rule Mining is a popular and well researched method for discovering interesting relations between variables in large data.  The rule found in the sales data of a supermarket would indicate that if a customer buys onions and potatoes together, he or she is likely to also buy hamburger meat.
  • 27.
  • 28.
    Slide 28 www.edureka.co/data-science RegressionAnalysis – Linear Regression Regression analysis helps understand how value of dependent variable changes when any one of independent variable changes, while other independent variables are kept fixed Linear Regression is the most popular algorithm used for prediction and forecasting
  • 29.
  • 30.
    Slide 30 www.edureka.co/data-science K-MeansClustering The process by which objects are classified into a number of groups so that they are as much dissimilar as possible from one group to another group, but as much similar as possible within each group. The objects in group 1 should be as similar as possible. But there should be much difference between objects in different groups The attributes of the objects are allowed to determine which objects should be grouped together. Total population Group 1 Group 2 Group 3 Group 4
  • 31.
  • 32.
    Slide 32 CourseUrl Thank You … Questions/Queries/Feedback Recording and presentation will be made available to you within 24 hours