Special Topics of
Machine Learning in
Lecture
Cyber Security 05
Machine Learning Basics
Arslan Ali Khan arsl
an.ali@riphah.edu.pk
Department of Cyber-Security and Data
Science Riphah Institute of Systems
Engineering (RISE), Riphah International
University, Islamabad, Pakistan.
Feature
Engineering
• Dealing with Missing Data
Missing values are data points that are absent for a specific
variable in a dataset. They can be represented in various ways,
such as blank cells, null values, or special symbols like “NA” or
“unknown.” These missing data points pose a significant
challenge in data analysis and can lead to inaccurate or biased
results.
Feature
Engineering
• Dealing with Missing Data
Missing values can pose a significant challenge in data analysis, as
they can:
• Reduce the sample size: This can decrease the accuracy and
reliability of your analysis.
• Introduce bias: If the missing data is not handled properly, it
can bias the results of your analysis.
• Make it difficult to perform certain analysis: Some statistical
techniques require complete data for all variables, making
them inapplicable when missing values are present
Feature
Engineering
• Dealing with Missing Data
Using Estimated values:
• Replacing missing values with estimated values.
• Preserves sample size: Doesn’t reduce data points.
• Can introduce bias: Estimated values might not be accurate.
Use of Mean, Median, and Mode:
• Replace missing values with the mean, median, or mode of the relevant variable.
• Simple and efficient: Easy to implement.
• Can be inaccurate: Doesn’t consider the relationships between variables.
Feature
Engineering
• Handling Categorical Data
Categorical data is data that can be divided into groups or
categories, such as gender, hair color, or product type.
Feature
Engineering
• Normalizing Data
Normalization in machine learning is the process of translating
data into the range [0, 1] (or any other range).
• Feature Construction or Generation
Feature Generation (also known as feature construction, feature
extraction or feature engineering) is the process of transforming
features into new features that better relate to the target. This
can involve mapping a feature into a new feature using a
function like log, or creating a new feature from one or multiple
features using multiplication or addition.
Feature 5
6
Scaling
A technique often applied as part of data preparation for machine learning.
Goal: Change the values of numeric columns in the dataset to a common scale,
without
distorting differences in the ranges of values.
Normalization
Min-max normalization: Guarantees all features will have the exact same scale but
does not handle outliers well.
Z-score standardization: Handles outliers, but does not produce normalized data
with the
exact same scale.
Training, Testing and Validation 5
7
Sets
Training, Testing and 5
8
Validation Set
K-Fold Cross 5
9
Validation
K-fold cross-validation is a
technique for evaluating
predictive models.
The dataset is divided into k
subsets or folds. The model is
trained and evaluated k
times, using a diff erent fold
as the validation set each
time.
Performance metrics from
each fold are averaged to
estimate the model's
generalization performance.
K-Fold Cross 6
0
Validation
Under-fitting and Over- 6
1
fitting
• Overfitting occurs when the model fits the training data too well and does
Overfittin not generalize so it performs badly on the test data.
g • Its the result of an excessively complic ated model.
Underfitting occurs when the model does not fit the data well
Underfittin • enough.
Is result of an excessively simple model.
g•
Under-fitting and Over- 6
2
fitting
• Both overfitting and underfitting lead to poor predictions on new
datasets.
• A learning model that overfits or underfits does not generalize
well.
Supervised vs. Unsupervised
Learning
• Supervised learning (classification)
Supervision: The training d a t a
(observations, measurements, etc.) are
a c c o m p a n i e d by labels indicating the
class of the observations
New d a ta is classified based on the
training set
• Unsupervised learning (clustering)
The class labels of training d a t a is unknown
Given a set of measurements, observations,
4
etc. with the aim of establishing the
Machine
Learning
• Supervised: We are given input samples (X) a n d output
samples (y) of a function y = f(X). We would like to “learn”
f, a n d evaluate it on new data. Types:
Classification: y is discrete (c lass la b e ls).
Regression: y is c ontinuous, e.g. linear regression.
• Unsupervised: Given only samples X of the data, w e
c om p u te a
function f suc h that y = f(X) is “simpler”.
Clustering: y is discrete
Y is continuous: Matrix factorization, Kalman filtering, unsupervised
neural
networks.
Technique
s
• Supervised
Learning:
Linear Regression
Logistic Regression
Decision Tree
Naïve Bayes
Random Forests
• Unsupervised
Learning:
Clustering
Factor analysis
Topic Models
Regressi 7
on
Regression 8
Task
Regression 1
0
Task
Linear Regression Vs Logistic 1
1
Regression
Linear Regression Vs Logistic 1
2
Regression
Linear 1
3
Regression
Regression 1
4
Task
Linear 1
5
Regression
Y = mx +
c
Linear Regression 1
6
Example
Linear Regression 1
7
Example