Build_Machine_Learning_System for Machine Learning Course

1 DTS304TC: Machine Learning Lecture 5: Building Machine Learning System Dr Kang Dang D-5032, Taicang Campus Kang.Dang@xjtlu.edu.cn Tel: 88973341

2 Machine Learning Pipeline Machine learning involves a comprehensive workflow, not just training models.

3 Q & A In practical machine learning roles, what percentage of time do you think is typically spent on data preparation and feature engineering? (A) 20% (B) 40% (C) 60% (D) 80%

4 Data Preparation and Feature Engineering The features you use influence more than everything else the result. No algorithm alone, to my knowledge, can supplement the information gain given by correct feature engineering. — Luca Massaron

5 Q&A • How would you handle missing values in a table? Fill with zeros or use other methods? What issues might arise from filling with zeros?

6 Different types of missing values • 3 Main Types of Missing Data | Do THIS Before Handling Missing Valu es! – YouTube

7 Missing Value Imputation MISSING COMPLETELY AT RANDOM MCAR (Missing Completely at Random) means the missing data is random and doesn't depend on anything else. For example, if survey answers are accidentally skipped or if a person simply chooses not to answer a question. Mean / Median/Mode Imputation, Random Sample Imputation MISSING AT RANDOM MAR (Missing at Random) means the missing data depends on other observed information. For example, people with higher incomes might be less likely to skip questions about financial spending than those with lower incomes. MissForest, to impute values for the missing entries. MISSING NOT AT RANDOM MNAR (Missing Not at Random) means the missing data is related to hidden factors. For example, people who have cheated might avoid answering a survey question about cheating. almost impossible to handle.

8 Mean/Median/Mode Imputation • Missing Data Nature: Confirmed as Missing Completely at Random (MCAR). • Extent of Missing Data: Limited to a maximum of 5% per variable. • Imputation Technique for Categorical Variables: Utilize mode imputation for the most frequent category. • Imputation Data Source: Calculate mean, median, or mode exclusively from the training dataset to prevent data leakage and maintain validation/test set integrity.

9 Regression Imputation – Miss Forest • Another great application of Random Forest! • Assume Data Missing At Random. • Utilizes entire dataset's information for imputation, enhancing the predictive accuracy of imputed values over simple mean/median/mode imputation

10 Regression Imputation – Miss Forest Iterative Approach: 1.First, fill missing values with a simple method (e.g., the mean). 2.Pick one column with missing data, use the available data to train a Random Forest model, and predict the missing values. 3.Move to the next column and repeat the process. 4.Continue this cycle until the missing values stop changing significantly or after 5-6 rounds.

11 MissForest vs Zero or Mean Imputation • If computational resources are not a limitation, prefer MissForest over simple imputations like zero or mean, which can distort the dataset's original distribution

12 Q & A Suppose I train a KNN feature classifier without scaling the features. For instance, one feature ranges from -1000 to 1000, while another ranges from -0.001 to 0.001. What potential issues could arise?

13 Feature Scaling Examples - KNN Without normalization, all the nearest neighbors will be biased to feature with larger range(x2) leading to incorrect classification.

14 Feature Scaling Examples - KNN Feature scaling can lead to completely different model in terms of decision boundary

15 Feature Scaling • Use when different numeric features have different scales (different range of values) • Features with much higher values may overpower the others • Goal: bring them all within the same range • Especially Important for the following models: • KNN: Distances depend mainly on feature with larger values • SVMs: (kernelized) dot products are also based on distances • Linear model: Feature scale affects regularization. Converge Faster!

16 Feature Scaling Standard Scalar Normalizes features to a standard Gaussian distribution. Centers the mean at 0 with a standard deviation of 1. Formula: x_scaled = (x – mean) / std_dev Use when data distribution is assumed to be normal. Min-Max Scaler: Scales features to a given range, often [0, 1]. Scales features to a given range, often [0, 1]. 、 Transforms all data points proportionally within the range x_scaled = (x – x_min) / (x_max – x_min) Use for scaling within a bounded range.

17 But how to handle feature scaling with outliers? Question: What is median? What is 75th percentile? Robust Scaler: Reduces the influence of outliers on scaling. • Centers using the median and scales using the IQR. • x_scaled = (x – median) / IQR • Use when outliers are present and need to be mitigated. • IQR Calculation: IQR = Q3 – Q1 (the difference between the 75th percentile (Q3) and the 25th percentile (Q1) in a dataset)

18 Q & A • Suppose you have a dataset with categorical features, such as 'dog' and 'cat'. Logistic regression, however, cannot directly handle categorical features. • To make these features compatible with the model, we might encode 'dog' as '0' and 'cat' as '1'. Is this a good approach? Why or why not?

19 Categorical Feature Encoding • Ordinal encoding • For example, “Jan, Feb, Mar, Apr” • Simply assigns an integer value to each category in the order they are encountered • Only really useful if there exist a natural order in categories • Model will consider one category to be ‘higher’ or ‘closer’ to another

20 Categorical Feature Encoding – One Hot Encoding • One-hot encoding (dummy encoding) • For example, “Cat, Dog, …” • Simply adds a new 0/1 feature for every category, having 1 (hot) if the sample has that category • Can explode if a feature has lots of values, causing issues with high dimensionality • What if test set contains a new category not seen in training data? • Either ignore it (just use all 0’s in row), or handle manually (eg. imputation)

21 Model Validation Scheme • Always evaluate models as if they are predicting future data • We do not have access to future data, so we pretend that some data is hidden • Simplest way: the holdout (simple train-val-test split) if dataset is sufficiently large • Randomly split data (and corresponding labels) into training and test set (e.g. 60%-20%-20%) • Train (fit) a model on the training data and tweak it on the validation data, then score on the test data

22 Q & A • What are issues with simple train-val-test split, when dataset is really small?

23 K-Fold Cross Validation • Each random split can yield very different models (and scores) • e.g. all easy (of hard) examples could end up in the test set • Split data into k equal-sized parts, called folds • Create k splits, each time using a different fold as the test set • Compute k evaluation scores, aggregate afterwards (e.g. take the mean) • Examine the score variance to see how sensitive (unstable) models are • Large k gives better estimates (more training data), but is expensive

24 K-Fold Cross Validation for Hyperparameter Tuning • After we obtained best hyperparameters (models) using cross validation, we can further apply it on a separate test data • In our coursework: we use simple train-val-test for simplicity, but you can also try this as additional technique

25 K-Fold Cross Validation for Model Ensembling • We can create model ensemble using K-Fold Cross Validation • One of the most common used tricks in Kaggle

26 Model Evaluation • We have a positive and a negative class • 2 different kind of errors: • False Positive : model predicts positive while true label is negative • False Negative: model predicts negative while true label is positive

27 Q&A • Suppose someone has cancer but was not diagnosed (missed detection). • Suppose someone was healthy but was diagnosed with cancer (false detection). • What are the consequences? Which situation is more serious?

28 Binary Model Evaluation – Confusion Matrix • We can represent all predictions (correct and incorrect) in a confusion matrix • n by n array (n is the number of classes) • Rows correspond to true classes, columns to predicted classes • Count how often samples belonging to a class C are classified as C or any other class. • For binary classification, we label these true negative (TN), true positive (TP), false negative (FN), false positive (FP)

29 Binary Model Evaluation – Precision, Recall and F1 • Precision: use when the goal is to limit FPs • Clinical trails: you only want to test drugs that really work • Search engines: you want to avoid bad search results • Recall: Use when the goal is to limit FNs • Cancer diagnosis: you don’t want to miss a serious disease • Search engines: You don’t want to omit important hits • F1-score: Trades off precision and recall:

30 Multi-class Evaluations • Train models per class : one class viewed as positive, other(s) also negative, then calculate metrics per class, you can get a per-class evaluation score. • Micro-averaging: count total TP, FP, TN, FN (every sample equally important) • Macro-averaging: average of scores obtained on each class • Preferable for imbalanced classes (if all classes are equally important) • macro-averaged recall is also called balanced accuracy • Weighted averaging

31 Summary • We discuss various feature engineering techniques, including feature scaling, missing value imputation, outlier handling and categorial feature encoding • We discuss the model selection and evaluation procedure, specifically cross-validation and evaluation metrics.

Build_Machine_Learning_System for Machine Learning Course

More Related Content

Similar to Build_Machine_Learning_System for Machine Learning Course

More from ssuserfece35

Recently uploaded

Build_Machine_Learning_System for Machine Learning Course

Editor's Notes