Data Scientist April 2018
Agenda • Data Science and its Application • Stages-Data science, Project roles • Classification , decision tree , random forest • Demo using R Technology
Data Science Introduction-Concepts • Data science is managing the process that can transform hypotheses and data into actionable predictions. Acquire Data Manage Data Choose Modelling Method Write Code Verify Result
Data Science Introduction-Applications • Amazon’s product recommendation systems • LinkedIn’s contact recommendation system • Retail Business – Buying patterns , segment • Twitter’s trending topics • Google’s advertisement valuation systems • Walmart’s consumer demand projection systems
Data Science Domains • Statistics, Linear Algebra, Optimization, Time Series, etc. Math and Theory • Machine Learning, Data Structures, Parallel Algorithms, etc. Applied Algorithms • Storage and computing platforms, statistical tools ,etc. Technologies • Finance, banking ,health industry, agriculture Domain Expertise • Visualization, Infographics Art
Data Science Introduction-Project Roles • Represents the business interests Project sponsor • Represents end users’ interests Client • Sets and executes analytic strategy Data scientist • Manages data and data storage Data architect • Manages infrastructure Operations
Data Science Introduction- Processes in Data Science Project Define the Goal Collect and manage data Build the model Evaluate the model Present results Deploy the model
Data Science – Modelling Methods Classification and Regression Trees (CART) k-Nearest Neighbors (kNN) Random Forest (RF) Support Vector Machines (SVM) with a linear kernel Linear Discriminate Analysis Training , Test and Validation
Data Science – Modelling Method Classification and Regression Trees • Example :- Finding bad loan applications • Input variables :- Loan amount, duration, age, salary , any other loan , address, Income , education , background data , location etc • 1000 applications exist out of which 300 have been defaulted • Decision Tree for identifying Potential defaulters
Classification and Regression Trees Durati on>50 Amou nt>4 millio n Amo unt> 1mil Amo unt< 5 mil Bad (0.68) Durat ion>1 20 Good (0.75) Good (0.56) Bad (0.25) Good (0.61) Bad (0.88)
Data Science – Modelling Method K – nearest Neighbors(Knn) • Example : Male , Female distribution Hair Length (cms) 60 40 20 0/ 140 150 160 170 180 190 200 Height (cms)
Data Science – Modelling Method K – nearest Neighbors(Knn) • Example : Male , Female distribution Hair Length (cms) 60 40 20 0/ 140 150 160 170 180 190 200 Height (cms)
Data Science – Modelling Method Random Forest (RF) Tree 1 Tree 3 Tree 2
Data Science – Modelling Method Random Forest (RF) Input All Trees Prediction Tree1: Tree2: Tree3: Random Forest Predicts:
Data Science – Modelling Method Random Forest (RF) Application where random forest algorithm is widely used: • Banking -loyal customer and fraud customers • Medicine-Disease (patient’s medical records) • Stock Market- Stock behavior, loss , Profit • E-commerce- Similar customer , segmentation
Data Science – Modelling Method Support Vector • Example : Male , Female distribution Hair Length (cms) 60 40 20 0/ 140 150 160 170 180 190 200 Height (cms)
Data Science – Modelling Method Support Vector • Example : Male , Female distribution Hair Length (cms) 60 40 20 0/ 140 150 160 170 180 190 200 Height (cms)
Data Science –Model Evaluation Process • Training , Test and Validation DATA Test/ Train Split Training DATA Test DATA Training Process Model Predictions
Data Science Demo Example
Demo Explanation • Data • 3 Species Setosa Versicolor Verginica
Demo Explanation • Load the package Caret and load the data • Split the data into 2 parts -80 % would be kept in dataset and 20 % into validation • Feed the dataset to 4 algorithms(CART,KNN,SV,RF) • Select the best algorithm • Feed the validation to best algorithm • Check the output
Data Science Demo • Installing the R platform. • Loading the dataset. • Summarizing the dataset. • Visualizing the dataset. • Evaluating some algorithms. • Making some predictions
Other Practical's of Data Science • https://towardsdatascience.com/examples-of- data-science-with-r-789c6996435 • Customer analysis and predictive analysis • Association rules –(medical diagnosis, bio- medical, census data, fraud detection, CRM) • Hr Analytics - Finding valuable employees and retaining it
Data Science Resources • Practical Data Science with R • Demo commands • R and R Studio installation files • Resources kept at below location • gb-pb-dbm-v01Data_Science_Resources
Questions and Feedbacks ?

Data Scientist Introduction bref overview of Concepts

  • 1.
  • 2.
    Agenda • Data Scienceand its Application • Stages-Data science, Project roles • Classification , decision tree , random forest • Demo using R Technology
  • 3.
    Data Science Introduction-Concepts •Data science is managing the process that can transform hypotheses and data into actionable predictions. Acquire Data Manage Data Choose Modelling Method Write Code Verify Result
  • 4.
    Data Science Introduction-Applications •Amazon’s product recommendation systems • LinkedIn’s contact recommendation system • Retail Business – Buying patterns , segment • Twitter’s trending topics • Google’s advertisement valuation systems • Walmart’s consumer demand projection systems
  • 5.
    Data Science Domains •Statistics, Linear Algebra, Optimization, Time Series, etc. Math and Theory • Machine Learning, Data Structures, Parallel Algorithms, etc. Applied Algorithms • Storage and computing platforms, statistical tools ,etc. Technologies • Finance, banking ,health industry, agriculture Domain Expertise • Visualization, Infographics Art
  • 6.
    Data Science Introduction-Project Roles •Represents the business interests Project sponsor • Represents end users’ interests Client • Sets and executes analytic strategy Data scientist • Manages data and data storage Data architect • Manages infrastructure Operations
  • 7.
    Data Science Introduction-Processes in Data Science Project Define the Goal Collect and manage data Build the model Evaluate the model Present results Deploy the model
  • 8.
    Data Science –Modelling Methods Classification and Regression Trees (CART) k-Nearest Neighbors (kNN) Random Forest (RF) Support Vector Machines (SVM) with a linear kernel Linear Discriminate Analysis Training , Test and Validation
  • 9.
    Data Science –Modelling Method Classification and Regression Trees • Example :- Finding bad loan applications • Input variables :- Loan amount, duration, age, salary , any other loan , address, Income , education , background data , location etc • 1000 applications exist out of which 300 have been defaulted • Decision Tree for identifying Potential defaulters
  • 10.
    Classification and RegressionTrees Durati on>50 Amou nt>4 millio n Amo unt> 1mil Amo unt< 5 mil Bad (0.68) Durat ion>1 20 Good (0.75) Good (0.56) Bad (0.25) Good (0.61) Bad (0.88)
  • 11.
    Data Science –Modelling Method K – nearest Neighbors(Knn) • Example : Male , Female distribution Hair Length (cms) 60 40 20 0/ 140 150 160 170 180 190 200 Height (cms)
  • 12.
    Data Science –Modelling Method K – nearest Neighbors(Knn) • Example : Male , Female distribution Hair Length (cms) 60 40 20 0/ 140 150 160 170 180 190 200 Height (cms)
  • 13.
    Data Science –Modelling Method Random Forest (RF) Tree 1 Tree 3 Tree 2
  • 14.
    Data Science –Modelling Method Random Forest (RF) Input All Trees Prediction Tree1: Tree2: Tree3: Random Forest Predicts:
  • 15.
    Data Science –Modelling Method Random Forest (RF) Application where random forest algorithm is widely used: • Banking -loyal customer and fraud customers • Medicine-Disease (patient’s medical records) • Stock Market- Stock behavior, loss , Profit • E-commerce- Similar customer , segmentation
  • 16.
    Data Science –Modelling Method Support Vector • Example : Male , Female distribution Hair Length (cms) 60 40 20 0/ 140 150 160 170 180 190 200 Height (cms)
  • 17.
    Data Science –Modelling Method Support Vector • Example : Male , Female distribution Hair Length (cms) 60 40 20 0/ 140 150 160 170 180 190 200 Height (cms)
  • 18.
    Data Science –ModelEvaluation Process • Training , Test and Validation DATA Test/ Train Split Training DATA Test DATA Training Process Model Predictions
  • 19.
  • 20.
    Demo Explanation • Data •3 Species Setosa Versicolor Verginica
  • 21.
    Demo Explanation • Loadthe package Caret and load the data • Split the data into 2 parts -80 % would be kept in dataset and 20 % into validation • Feed the dataset to 4 algorithms(CART,KNN,SV,RF) • Select the best algorithm • Feed the validation to best algorithm • Check the output
  • 22.
    Data Science Demo •Installing the R platform. • Loading the dataset. • Summarizing the dataset. • Visualizing the dataset. • Evaluating some algorithms. • Making some predictions
  • 23.
    Other Practical's ofData Science • https://towardsdatascience.com/examples-of- data-science-with-r-789c6996435 • Customer analysis and predictive analysis • Association rules –(medical diagnosis, bio- medical, census data, fraud detection, CRM) • Hr Analytics - Finding valuable employees and retaining it
  • 24.
    Data Science Resources •Practical Data Science with R • Demo commands • R and R Studio installation files • Resources kept at below location • gb-pb-dbm-v01Data_Science_Resources
  • 25.