Applied Data Science Part 3: Getting dirty; data preparation and feature creation

©2018 dataiku, Inc. ● September 20th at 12PM ET: Learning the Basics, concepts, & your first ML model ● September 27th at 12PM ET: The data science workflow, building a predictive model flow ● October 4th at 12PM ET: Getting dirty; data preparation and feature creation ● October 11th at 12PM ET: Understanding your model - and communicating about it Curriculum Go from Small to Big Data in 4 weeks

©2018 dataiku, Inc. Most Advanced version of the workflow The Data Science Workflow Determine business objectives Assess situation Determine data science goals Produce project plan Collect data Describe data Explore data Assess qualitySelectCleanBuildReport Reformat Select modeling techniques Generate test design Build model Assess model Evaluate results Review process Determine next steps Check common risks Plan deployment Plan monitoring & performance Produce final report Review project

©2018 dataiku, Inc. Different types of Data Definitions Structured Unstructured Data stored with clearly defined data types whose pattern makes them easily searchable and linkable - most often tabular format Data is not structured via pre-defined data models or schema. Examples: • Database with columns for name, phone number etc Examples: • Text files • Websites and social media

©2018 dataiku, Inc. Different types of Storage Definitions Relational Databases Unstructured data storage Data is stored in different tables that are connected with unique identifiers Systems that can store and process both structured and unstructured data Examples: • Structured Query Language DB Examples: • Hadoop • NOSQL

©2018 dataiku, Inc. Different type of structured data Data can be one of several categories Examples: • Gender • Nationality • Hair color Data is a number Examples: • Age • Weight • Salary Data is free-form text Examples: • Tweets • Documents • Business name + semi-structured data: json

©2018 dataiku, Inc. Different type of structured data Examples Eye Color (e.g. Blue) Height (e.g. 170 cm) Country of Birth (e.g. France) Postal Code (e.g. 75001) Date (e.g. Wednesday, 15 Jan 1976) Address (e.g. 10 Rue Saint Martin, Paris) Curriculum Vitae

©2018 dataiku, Inc. Drop values Delete all data from any participant with missing values Few Warnings: • Be sure your sample is large enough, then you likely can drop data without substantial loss of statistical power. • Be sure data is not missing at Random: There is a pattern in the missing data that affect your primary dependent variables. For example, lower-income participants are less likely to respond income column.

©2018 dataiku, Inc. Imputation Replacing missing values with substitute values. 14 Method #1: Common Value For Number: • Average • Median • Constant Value For Category: • Treat like the category « Empty » • Most frequent value • A constant value Method #2: Educated Guess Infer a missing value: • If Age is lower than 20, Income is likely to be 0 • If living in a house in a rich city: income is likely to be higher than average • Nb. of child is likely to not be 0 if age is high and situation not married Method #3: Sub-Model • Create a specific model of machine learning to predict the missing (Regression, Classification)

©2018 dataiku, Inc. Aggregate data from an entity How can you aggregate this dataset? Group By Options: For Number: • Average • Sum • Minimum and Maximum • Standard Deviation For Category and Number: • Count of Value • Count of Distinct Value • First and Last Value • Most Frequent

©2018 dataiku, Inc. ● Some data is in a textual or numerical format but should be understood as a category by your model ● This also allows you to use non numerical data in a linear model > Create a dummy variable that corresponds to these categories: 0-1 for linear model ● Your numerical data can be distributed in a way that will be misunderstood by your model > Change the values to rearrange them on a scale Dummification & Rescaling: the issue The specificity of your data can create bias in your model 15 0 15 0 18 0 18 0 Y= a1 x1 + a2 x2 + a3 x3 + a4 x4 + a5 x5 Father’s Height Your Height

Applied Data Science Part 3: Getting dirty; data preparation and feature creation

More Related Content

What's hot

Similar to Applied Data Science Part 3: Getting dirty; data preparation and feature creation

More from Dataiku

Recently uploaded

In this document

Applied Data Science Part 3: Getting dirty; data preparation and feature creation