©2018 dataiku, Inc. Applied Data Science Online Course 3rd Class: Getting dirty; data preparation and feature creation
©2018 dataiku, Inc. ● September 20th at 12PM ET: Learning the Basics, concepts, & your first ML model ● September 27th at 12PM ET: The data science workflow, building a predictive model flow ● October 4th at 12PM ET: Getting dirty; data preparation and feature creation ● October 11th at 12PM ET: Understanding your model - and communicating about it Curriculum Go from Small to Big Data in 4 weeks
©2018 dataiku, Inc. Most Advanced version of the workflow The Data Science Workflow Determine business objectives Assess situation Determine data science goals Produce project plan Collect data Describe data Explore data Assess qualitySelectCleanBuildReport Reformat Select modeling techniques Generate test design Build model Assess model Evaluate results Review process Determine next steps Check common risks Plan deployment Plan monitoring & performance Produce final report Review project
©2018 dataiku, Inc. Data preparation 70% of the work Select Clean Build ReportReformat
©2018 dataiku, Inc. Different types of data
©2018 dataiku, Inc. Different types of Data Definitions Structured Unstructured Data stored with clearly defined data types whose pattern makes them easily searchable and linkable - most often tabular format Data is not structured via pre-defined data models or schema. Examples: • Database with columns for name, phone number etc Examples: • Text files • Websites and social media
©2018 dataiku, Inc. Different types of Data Definitions
©2018 dataiku, Inc. Different types of Storage Definitions Relational Databases Unstructured data storage Data is stored in different tables that are connected with unique identifiers Systems that can store and process both structured and unstructured data Examples: • Structured Query Language DB Examples: • Hadoop • NOSQL
©2018 dataiku, Inc. Different type of structured data Data can be one of several categories Examples: • Gender • Nationality • Hair color Data is a number Examples: • Age • Weight • Salary Data is free-form text Examples: • Tweets • Documents • Business name + semi-structured data: json
©2018 dataiku, Inc. Different type of structured data Examples Eye Color (e.g. Blue) Height (e.g. 170 cm) Country of Birth (e.g. France) Postal Code (e.g. 75001) Date (e.g. Wednesday, 15 Jan 1976) Address (e.g. 10 Rue Saint Martin, Paris) Curriculum Vitae
©2018 dataiku, Inc. Missing values
©2018 dataiku, Inc. What to do with missing values?
©2018 dataiku, Inc. Drop values Delete all data from any participant with missing values Few Warnings: • Be sure your sample is large enough, then you likely can drop data without substantial loss of statistical power. • Be sure data is not missing at Random: There is a pattern in the missing data that affect your primary dependent variables. For example, lower-income participants are less likely to respond income column.
©2018 dataiku, Inc. Imputation Replacing missing values with substitute values. 14 Method #1: Common Value For Number: • Average • Median • Constant Value For Category: • Treat like the category « Empty » • Most frequent value • A constant value Method #2: Educated Guess Infer a missing value: • If Age is lower than 20, Income is likely to be 0 • If living in a house in a rich city: income is likely to be higher than average • Nb. of child is likely to not be 0 if age is high and situation not married Method #3: Sub-Model • Create a specific model of machine learning to predict the missing (Regression, Classification)
©2018 dataiku, Inc. Example How to handle the missing data in this doc? Not OK • Drop Rows • Average Better : • Educated guess from SPCategory • Sub-prediction Model
©2018 dataiku, Inc. Grouping & Joining Data
©2018 dataiku, Inc. Join operation Similar to a VLOOKUP / INDEX MATCH
©2018 dataiku, Inc. Group by operations Definition
©2018 dataiku, Inc. Aggregate data from an entity How can you aggregate this dataset?
©2018 dataiku, Inc. Aggregate data from an entity How can you aggregate this dataset? Group By Options: For Number: • Average • Sum • Minimum and Maximum • Standard Deviation For Category and Number: • Count of Value • Count of Distinct Value • First and Last Value • Most Frequent
©2018 dataiku, Inc. Result
©2018 dataiku, Inc. Dummification & Rescaling
©2018 dataiku, Inc. ● Some data is in a textual or numerical format but should be understood as a category by your model ● This also allows you to use non numerical data in a linear model > Create a dummy variable that corresponds to these categories: 0-1 for linear model ● Your numerical data can be distributed in a way that will be misunderstood by your model > Change the values to rearrange them on a scale Dummification & Rescaling: the issue The specificity of your data can create bias in your model 15 0 15 0 18 0 18 0 Y= a1 x1 + a2 x2 + a3 x3 + a4 x4 + a5 x5 Father’s Height Your Height
©2018 dataiku, Inc. Dummification for linear models For categorical variables Then your formula will look like this: Y= abatteries x1 + aSoap x2 + aCandy x3… And X1, X2, X3… are either 0 or 1
©2018 dataiku, Inc. Rescaling for numerical variables Without rescaling your formula will look like this: Yrosalindrosenbaum = aincome *28114 + aChild *1… YBrettStamm = aincome *47901 + aChild *3…
©2018 dataiku, Inc. Feature rescaling for linear models Feature scaling is a method used to standardize the range of independent variables Example of Rescaling – Min-Max Rescaling
©2018 dataiku, Inc. Qu s o s?
©2018 dataiku, Inc. Hands-on
©2018 dataiku, Inc.
©2018 dataiku, Inc. About Dataiku - Your Path to Enterprise AI

Applied Data Science Part 3: Getting dirty; data preparation and feature creation

  • 1.
    ©2018 dataiku, Inc. AppliedData Science Online Course 3rd Class: Getting dirty; data preparation and feature creation
  • 2.
    ©2018 dataiku, Inc. ●September 20th at 12PM ET: Learning the Basics, concepts, & your first ML model ● September 27th at 12PM ET: The data science workflow, building a predictive model flow ● October 4th at 12PM ET: Getting dirty; data preparation and feature creation ● October 11th at 12PM ET: Understanding your model - and communicating about it Curriculum Go from Small to Big Data in 4 weeks
  • 3.
    ©2018 dataiku, Inc. MostAdvanced version of the workflow The Data Science Workflow Determine business objectives Assess situation Determine data science goals Produce project plan Collect data Describe data Explore data Assess qualitySelectCleanBuildReport Reformat Select modeling techniques Generate test design Build model Assess model Evaluate results Review process Determine next steps Check common risks Plan deployment Plan monitoring & performance Produce final report Review project
  • 4.
    ©2018 dataiku, Inc. Datapreparation 70% of the work Select Clean Build ReportReformat
  • 5.
  • 6.
    ©2018 dataiku, Inc. Differenttypes of Data Definitions Structured Unstructured Data stored with clearly defined data types whose pattern makes them easily searchable and linkable - most often tabular format Data is not structured via pre-defined data models or schema. Examples: • Database with columns for name, phone number etc Examples: • Text files • Websites and social media
  • 7.
    ©2018 dataiku, Inc. Differenttypes of Data Definitions
  • 8.
    ©2018 dataiku, Inc. Differenttypes of Storage Definitions Relational Databases Unstructured data storage Data is stored in different tables that are connected with unique identifiers Systems that can store and process both structured and unstructured data Examples: • Structured Query Language DB Examples: • Hadoop • NOSQL
  • 9.
    ©2018 dataiku, Inc. Differenttype of structured data Data can be one of several categories Examples: • Gender • Nationality • Hair color Data is a number Examples: • Age • Weight • Salary Data is free-form text Examples: • Tweets • Documents • Business name + semi-structured data: json
  • 10.
    ©2018 dataiku, Inc. Differenttype of structured data Examples Eye Color (e.g. Blue) Height (e.g. 170 cm) Country of Birth (e.g. France) Postal Code (e.g. 75001) Date (e.g. Wednesday, 15 Jan 1976) Address (e.g. 10 Rue Saint Martin, Paris) Curriculum Vitae
  • 11.
  • 12.
    ©2018 dataiku, Inc. Whatto do with missing values?
  • 13.
    ©2018 dataiku, Inc. Dropvalues Delete all data from any participant with missing values Few Warnings: • Be sure your sample is large enough, then you likely can drop data without substantial loss of statistical power. • Be sure data is not missing at Random: There is a pattern in the missing data that affect your primary dependent variables. For example, lower-income participants are less likely to respond income column.
  • 14.
    ©2018 dataiku, Inc. Imputation Replacingmissing values with substitute values. 14 Method #1: Common Value For Number: • Average • Median • Constant Value For Category: • Treat like the category « Empty » • Most frequent value • A constant value Method #2: Educated Guess Infer a missing value: • If Age is lower than 20, Income is likely to be 0 • If living in a house in a rich city: income is likely to be higher than average • Nb. of child is likely to not be 0 if age is high and situation not married Method #3: Sub-Model • Create a specific model of machine learning to predict the missing (Regression, Classification)
  • 15.
    ©2018 dataiku, Inc. Example Howto handle the missing data in this doc? Not OK • Drop Rows • Average Better : • Educated guess from SPCategory • Sub-prediction Model
  • 16.
  • 17.
    ©2018 dataiku, Inc. Joinoperation Similar to a VLOOKUP / INDEX MATCH
  • 18.
    ©2018 dataiku, Inc. Groupby operations Definition
  • 19.
    ©2018 dataiku, Inc. Aggregatedata from an entity How can you aggregate this dataset?
  • 20.
    ©2018 dataiku, Inc. Aggregatedata from an entity How can you aggregate this dataset? Group By Options: For Number: • Average • Sum • Minimum and Maximum • Standard Deviation For Category and Number: • Count of Value • Count of Distinct Value • First and Last Value • Most Frequent
  • 21.
  • 22.
  • 23.
    ©2018 dataiku, Inc. ●Some data is in a textual or numerical format but should be understood as a category by your model ● This also allows you to use non numerical data in a linear model > Create a dummy variable that corresponds to these categories: 0-1 for linear model ● Your numerical data can be distributed in a way that will be misunderstood by your model > Change the values to rearrange them on a scale Dummification & Rescaling: the issue The specificity of your data can create bias in your model 15 0 15 0 18 0 18 0 Y= a1 x1 + a2 x2 + a3 x3 + a4 x4 + a5 x5 Father’s Height Your Height
  • 24.
    ©2018 dataiku, Inc. Dummificationfor linear models For categorical variables Then your formula will look like this: Y= abatteries x1 + aSoap x2 + aCandy x3… And X1, X2, X3… are either 0 or 1
  • 25.
    ©2018 dataiku, Inc. Rescalingfor numerical variables Without rescaling your formula will look like this: Yrosalindrosenbaum = aincome *28114 + aChild *1… YBrettStamm = aincome *47901 + aChild *3…
  • 26.
    ©2018 dataiku, Inc. Featurerescaling for linear models Feature scaling is a method used to standardize the range of independent variables Example of Rescaling – Min-Max Rescaling
  • 27.
  • 28.
  • 29.
  • 30.
    ©2018 dataiku, Inc. AboutDataiku - Your Path to Enterprise AI