Top 5 algorithms used in Data Science

www.edureka.co/data-science Top 5 Algorithms Used in Data Science

www.edureka.co/data-science What are we going to learn today ? At the end of the session you will be able to understand :  What is Data Science  What does Data Scientists do  Top 5 Data Science Algorithms  Decision Tree  Random Forest  Association Rule Mining  Linear Regression  K-Means Clustering  Demo on K-Means Clustering algorithm

www.edureka.co/data-science Data Science

www.edureka.co/data-science What is Data Science ? Data science is nothing but extracting meaningful and actionable knowledge from data

www.edureka.co/data-science Who are Data Scientists ? Basically data scientists are humans who have multitude of skills and who love playing with data

www.edureka.co/data-science Data Science from 1000 feet Data Science Visualization Data Engineering Statistics Advanced Computing Domain Expertise

www.edureka.co/data-science Arsenal of a Data Scientist Data Science Data Architecture Tool: Hadoop Machine Learning Tool: Mahout, Weka, Spark MLlib Analytics Tool: R, Python Note that evaluating different machine learning algorithms is a daily work of a data scientist. So it becomes very important for a data scientist to have a good grip over various machine learning algorithms.

www.edureka.co/data-science Machine Learning Machine Learning is a method of teaching computers to make and improve predictions based on data Machine learning is a huge field, with hundreds of different algorithms for solving myriad different problems Supervised Learning : The categories of the data is already known Unsupervised Learning : The learning process attempts to find appropriate category for the data

www.edureka.co/data-science Decision TreeDecision Tree

www.edureka.co/data-science Decision Tree Example Training Data

www.edureka.co/data-science Decision Tree, Root : Student Step-1 Student

www.edureka.co/data-science Decision Tree, Root : Student Step-2 Student Income Income Medium

www.edureka.co/data-science Decision Tree, Root : Student Step-3 Student Income Income YES YES Medium

www.edureka.co/data-science Decision Tree, Root : Student Student Income Income Age CR CR YES YES31….40 Medium Step-4

www.edureka.co/data-science Decision Tree, Root : Student Student Income Income Age CR CR No Yes Yes Yes Yes 31….40 Medium Step-5

www.edureka.co/data-science Decision Tree, Root : Student Student Income Income Age CR No Yes 31….40 Age Age Yes No No Yes 31….40 CR Age Yes No > 40 31….40 Yes Yes Yes Fair Medium Step-6

www.edureka.co/data-science Decision Tree, Root : Student  1. student(no)încome(high)âge(<=30) => buys_computer(no)  2. student(no)încome(high)âge(31…40) => buys_computer(yes)  3. student(no)încome(medium)^CR(fair)âge(>40) => buys_computer(yes)  4. student(no)încome(medium)^CR(fair)âge(<=30) => buys_computer(no)  5. student(no)încome(medium)^CR(excellent)âge(>40) => buys_computer(no)  6. student(no)încome(medium)^CR(excellent)âge(31..40) =>buys_computer(yes)  7. student(yes)încome(low)^CR(fair) => buys_computer(yes)  8. student(yes)încome(low)^CR(excellent)âge(31..40) => buys_computer(yes)  9. student(yes)încome(low)^CR(excellent)âge(>40) => buys_computer(no)  10. student(yes)încome(medium)=> buys_computer(yes)  11. student(yes)încome(high)=> buys_computer(yes) Classification rules :

www.edureka.co/data-science Random ForestRandom Forest

www.edureka.co/data-science Random Forest : Example Suppose you're very indecisive about watching a movie. “Edge of Tomorrow” You can do one of the following : 1. Either you ask your best friend, whether you will like the movie. 2. Or You can ask your group of friends.

www.edureka.co/data-science Random Forest : Example In order to answer, your best friend first needs to figure out what movies you like, so you give her a bunch of movies and tell her whether you liked each one or not (i.e., you give her a labelled training set) Example: Do you like movies starring Emily Blunt ? Ask Best Friend Is it based on a true incident? Does Emily Blunt star in it? No Is she the main lead? Yes, You will like the movie No Yes No, You will not like the movie No, You will not like the movie

www.edureka.co/data-science Random Forest : Example But your best friend might not always generalize your preferences very well (i.e., she overfits) In order to get more accurate recommendations, you'd like to ask a bunch of your friends e.g. Friend#1, Friend#2, and Friend#3 and they vote on whether you will like a movie The majority of the votes will decide the final outcome

www.edureka.co/data-science Random Forest : Example You didn’t like ‘Far and away’ You liked ‘Oblivion’ You like action movies You like Tom Cruise You like his pairing with Emily Blunt Yes, You will like the movie Yes, You will like the movie Yes, You will like the movie Friend 2 You did not like ‘Top Gun’ You loved ‘Godzilla’ Friend 1 No, You will not like the movie Yes, You will like the movie You hate Tom Cruise Friend 3 No, You will not like the movie

www.edureka.co/data-science What is Random Forest ? Random Forest is an ensemble classifier made using many decision tree models. What are ensemble models?  Ensemble models combine the results from different models.  The result from an ensemble model is usually better than the result from one of the individual models.

www.edureka.co/data-science Association Rule MiningAssociation Rule Mining

www.edureka.co/data-science Association Rule Mining

www.edureka.co/data-science Association Rule Mining  Association Rule Mining is a popular and well researched method for discovering interesting relations between variables in large data.  The rule found in the sales data of a supermarket would indicate that if a customer buys onions and potatoes together, he or she is likely to also buy hamburger meat.

www.edureka.co/data-science Linear RegressionLinear Regression

www.edureka.co/data-science Regression Analysis – Linear Regression Regression analysis helps understand how value of dependent variable changes when any one of independent variable changes, while other independent variables are kept fixed Linear Regression is the most popular algorithm used for prediction and forecasting

www.edureka.co/data-science K-Means ClusteringK-Means Clustering

www.edureka.co/data-science K-Means Clustering The process by which objects are classified into a number of groups so that they are as much dissimilar as possible from one group to another group, but as much similar as possible within each group. The objects in group 1 should be as similar as possible. But there should be much difference between objects in different groups The attributes of the objects are allowed to determine which objects should be grouped together. Total population Group 1 Group 2 Group 3 Group 4

www.edureka.co/data-science Hands-On Demo K-Means Clustering

Course Url Thank You … Questions/Queries/Feedback Recording and presentation will be made available to you within 24 hours

Top 5 algorithms used in Data Science

More Related Content

What's hot

Viewers also liked

Similar to Top 5 algorithms used in Data Science

More from Edureka!

Recently uploaded

Top 5 algorithms used in Data Science