Chapter Five
Chapter Five
By: Teshome M.
 1
 Why Machine learning
• Write a C ++ to recognize a face from the provided
 image.
 2
What is learning
 3
 Basic term of Machine learing
• Dataset
 • A machine learning dataset is a data collection used to
 train the model. A dataset is an example of teaching the
 machine learning algorithm how to make predictions.
 4
 Cont.
• Training data vs. Test data
 • Training data is the subset of original data that is used to
 train the machine learning model (70 to 80
 percent of the dataset)
 5
Cont.
 6
Cont.
 7
 What is machine learning
• Machine learning is a subset of AI, which enables the
 machine to automatically learn from data, improve
 performance from past experiences, and make predictions.
 Machine learning contains a set of algorithms that work on
 a huge amount of data. Data is fed to these algorithms to
 train them, and on the basis of training, they build the
 model & perform a specific task.
• These ML algorithms help to solve different business
 problems like Regression, Classification, Forecasting,
 Clustering, and Associations, etc.
 8
Cont..
 9
 How does Machine Learning work
• A machine learning system builds prediction models,
 learns from previous data, and predicts the output of
 new data whenever it receives it. The amount of data
 helps to build a better model that accurately predicts
 the output, which in turn affects the accuracy of the
 predicted output.
• Let's say we have a complex problem in which we
 need to make predictions. Instead of writing code, we
 just need to feed the data to generic algorithms, which
 build the logic based on the data and predict the
 output. Our perspective on the issue has changed as a
 result of machine learning. The Machine Learning
 algorithm's operation is depicted in the following
 block diagram:
 10
Cont..
 11
 Features of Machine Learning:
• Machine learning uses data to detect various
 patterns in a given dataset.
• It can learn from past data and improve
 automatically.
• It is a data-driven technology.
• Machine learning is much similar to data mining as
 it also deals with the huge amount of the data.
 12
 Importance of Machine Learning:
• Rapid increment in the production of data
• Solving complex problems, which are difficult for a
 human
• Decision making in various sector including finance
• Finding hidden patterns and extracting useful
 information from data.
 13
 Types of machine learning
• Based on the methods and way of learning,
 machine learning is divided into mainly four types,
 which are:
• Supervised Machine Learning
• Unsupervised Machine Learning
• Semi-Supervised Machine Learning
• Reinforcement Learning
 14
 1. Supervised Machine Learning
• As its name suggests, Supervised machine
 learning is based on supervision. It means in the
 supervised learning technique, we train the
 machines using the "labelled" dataset, and based on
 the training, the machine predicts the output.
 Here, the labelled data specifies that some of the
 inputs are already mapped to the output. More
 preciously, we can say; first, we train the machine
 with the input and corresponding output, and then
 we ask the machine to predict the output using the
 test dataset.
 15
Cont.
 16
 Cont.
• Let's understand supervised learning with an example.
 Suppose we have an input dataset of cats and dog
 images. So, first, we will provide the training to the
 machine to understand the images, such as the shape &
 size of the tail of cat and dog, Shape of eyes, colour,
 height (dogs are taller, cats are smaller), etc. After
 completion of training, we input the picture of a cat
 and ask the machine to identify the object and predict
 the output.
• Now, the machine is well trained, so it will check all
 the features of the object, such as height, shape,
 colour, eyes, ears, tail, etc., and find that it's a cat. So,
 it will put it in the Cat category. This is the process of
 how the machine identifies the objects in Supervised
 Learning.
 17
• The main goal of the supervised learning technique
 is to map the input variable(x) with the output
 variable(y). Some real-world applications of
 supervised learning are Risk Assessment, Fraud
 Detection, Spam filtering, etc.
Categories of Supervised Machine Learning
• Supervised machine learning can be classified into
 two types of problems, which are given below:
A. Classification
B. Regression
 18
 a) Classification
• Classification algorithms are used to solve the
 classification problems in which the output variable is
 categorical, such as "Yes" or No, Male or Female, Red
 or Blue, etc. The classification algorithms predict the
 categories present in the dataset. Some real-world
 examples of classification algorithms are Spam
 Detection, Email filtering, etc.
• Some popular classification algorithms are given
 below:
• Random Forest Algorithm
• Decision Tree Algorithm
• Logistic Regression Algorithm
• Support Vector Machine Algorithm
 19
 b) Regression
• Regression algorithms are used to solve regression
 problems in which there is a linear relationship
 between input and output variables. These are
 used to predict continuous output variables, such
 as market trends, weather prediction, etc.
• Some popular Regression algorithms are given
 below:
• Simple Linear Regression Algorithm
• Multivariate Regression Algorithm
• Decision Tree Algorithm
• Lasso Regression
 20
Advantages and Disadvantages of Supervised
 Learning
 21
Applications of Supervised Learning
• Some common applications of Supervised Learning are given
 below:
• Image Segmentation:
 Supervised Learning algorithms are used in image segmentation. In
 this process, image classification is performed on different image
 data with pre-defined labels.
• Fraud Detection - Supervised Learning classification algorithms are
 used for identifying fraud transactions, fraud customers, etc. It is
 done by using historic data to identify the patterns that can lead to
 possible fraud.
• Spam detection - In spam detection & filtering, classification
 algorithms are used. These algorithms classify an email as spam or
 not spam. The spam emails are sent to the spam folder.
• Speech Recognition - Supervised learning algorithms are also used in
 speech recognition. The algorithm is trained with voice data, and
 various identifications can be done using the same, such as voice-
 activated passwords, voice commands, etc.
 22
 2. Unsupervised Machine Learning
• Unsupervised learning is different from the
 Supervised learning technique; as its name suggests,
 there is no need for supervision. It means, in
 unsupervised machine learning, the machine is
 trained using the unlabeled dataset, and the
 machine predicts the output without any
 supervision.
• In unsupervised learning, the models are trained
 with the data that is neither classified nor labelled,
 and the model acts on that data without any
 supervision.
 23
 Cont.
• The main aim of the unsupervised learning algorithm is
 to group or categories the unsorted dataset according
 to the similarities, patterns, and differences. Machines
 are instructed to find the hidden patterns from the
 input dataset.
• Let's take an example to understand it more preciously;
 suppose there is a basket of fruit images, and we input
 it into the machine learning model. The images are
 totally unknown to the model, and the task of the
 machine is to find the patterns and categories of the
 objects.
• So, now the machine will discover its patterns and
 differences, such as colour difference, shape difference,
 and predict the output when it is tested with the test
 dataset.
 24
Example
 25
Cont.
 26
Cont.
 27
Categories of Unsupervised Machine Learning
• Unsupervised Learning can be further classified into
 two types, which are given below:
• Clustering
• Association
• 1) Clustering
The clustering technique is used when we want to find
the inherent groups from the data. It is a way to group
the objects into a cluster such that the objects with the
most similarities remain in one group and have fewer or
no similarities with the objects of other groups. An
example of the clustering algorithm is grouping the
customers by their purchasing behaviour.
 28
 Cont..
• Some of the popular clustering algorithms are
 given below:
• K-Means Clustering algorithm
• Mean-shift algorithm
• DBSCAN Algorithm
• Principal Component Analysis
• Independent Component Analysis
 29
 2) Association
• Association rule learning is an unsupervised
 learning technique, which finds interesting relations
 among variables within a large dataset. The main
 aim of this learning algorithm is to find the
 dependency of one data item on another data
 item and map those variables accordingly so that it
 can generate maximum profit. This algorithm is
 mainly applied in Market Basket analysis, Web
 usage mining, etc.
 30
 Example
• Some popular algorithms of Association rule
 learning are
1. Apriori Algorithm,
2. Éclat,
3. FP-growth algorithm.
 31
Cont..
 32
Applications of Unsupervised Learning
• Network Analysis: Unsupervised learning is used for
 identifying plagiarism and copyright in document
 network analysis of text data for scholarly articles.
• Recommendation Systems: Recommendation systems
 widely use unsupervised learning techniques for
 building recommendation applications for different
 web applications and e-commerce websites.
• Anomaly Detection: Anomaly detection is a popular
 application of unsupervised learning, which can
 identify unusual data points within the dataset. It is
 used to discover fraudulent transactions.
 33
 3. Semi-Supervised Learning
• Semi-Supervised learning is a type of Machine Learning
 algorithm that lies between Supervised and
 Unsupervised machine learning. It represents the
 intermediate ground between Supervised (With
 Labelled training data) and Unsupervised learning
 (with no labelled training data) algorithms and uses
 the combination of labelled and unlabeled datasets
 during the training period.
• Although Semi-supervised learning is the middle
 ground between supervised and unsupervised learning
 and operates on the data that consists of a few labels,
 it mostly consists of unlabeled data. As labels are
 costly, but for corporate purposes, they may have few
 labels.
 34
 Cont.
• To overcome the drawbacks of supervised learning
 and unsupervised learning algorithms, the concept
 of Semi-supervised learning is introduced. The
 main aim of semi-supervised learning is to
 effectively use all the available data, rather than
 only labelled data like in supervised learning.
 Initially, similar data is clustered along with an
 unsupervised learning algorithm, and further, it
 helps to label the unlabeled data into labelled
 data. It is because labelled data is a comparatively
 more expensive acquisition than unlabeled data.
 35
 Advantages and disadvantages of Semi-supervised
 Learning
Advantages:
• It is simple and easy to understand the algorithm.
• It is highly efficient.
• It is used to solve drawbacks of Supervised and
 Unsupervised Learning algorithms.
Disadvantages:
• Iterations results may not be stable.
• Accuracy is low.
 36
 4. Reinforcement Learning
• Reinforcement learning works on a feedback-based
 process, in which an AI agent (A software
 component) automatically explore its surrounding
 by hitting & trail, taking action, learning from
 experiences, and improving its performance. Agent
 gets rewarded for each good action and get
 punished for each bad action; hence the goal of
 reinforcement learning agent is to maximize the
 rewards.
 37
 Cont.
• In reinforcement learning, there is no labelled data
 like supervised learning, and agents learn from
 their experiences only.
• The reinforcement learning process is similar to a
 human being; for example, a child learns various
 things by experiences in his day-to-day life. An
 example of reinforcement learning is to play a
 game, where the Game is the environment, moves
 of an agent at each step define states, and the goal
 of the agent is to get a high score. Agent receives
 feedback in terms of punishment and rewards.
 38
 Cont..
• Due to its way of working, reinforcement learning
 is employed in different fields such as Game
 theory, Operation Research, Information theory,
 multi-agent systems.
• A reinforcement learning problem can be
 formalized using Markov Decision
 Process(MDP). In MDP, the agent constantly
 interacts with the environment and performs
 actions; at each action, the environment responds
 and generates a new state.
 39
 Categories of Reinforcement Learning
• Reinforcement learning is categorized mainly into two
 types of methods/algorithms:
 43
 Supervised machine learning algorithms
1. Linear Regression :
Linear Regression is a method used to define a
relationship between a dependent variable (Y) and
independent variable (X). Which is simply written as
 44
 Cont.
• Linear regression predicts the relationship between
 two variables by assuming a linear connection
 between the independent and dependent
 variables.
• It seeks the optimal line that minimizes the sum of
 squared differences between predicted and actual
 values. Applied in various domains like economics
 and finance, this method analyzes and forecasts
 data trends. It can extend to multiple linear
 regression involving several independent variables
 and logistic regression, suitable for binary
 classification problems.
 45
 Cont..
• Linear regression is a quiet and the simplest
 statistical regression method used for predictive
 analysis in machine learning. Linear regression
 shows the linear relationship between the
 independent(predictor) variable i.e. X-axis and the
 dependent(output) variable i.e. Y-axis, called linear
 regression. If there is a single input
 variable X(independent variable), such linear
 regression is simple linear regression.
 46
Cont.
 47
 Cont..
• The graph above presents the linear relationship
 between the output(y) and predictor(X)
 variables. The blue line is referred to as the best-
 fit straight line. Based on the given data points, we
 attempt to plot a line that fits the points the best.
 48
Mean square error of data points
 49
Slope at particular point(derivation)
 50
Computing M and B in each iteration by using the following
 formula
Learning rate is simple it is the step size while finding the global minimum ,
some it is some time common to start learning rate with 0.001/0.0001 in
machine learning.
 51
Illustration of m and b change while
 finding global minimum
 52
 Example
• Given the following dataset (points) of x values
 with respective to y values and construct the best
 regression model up to 10000 iteration.
 53
 Solution (iteration 0)
Mean square error= 89
Y_predicted =[0,0,0,0,0]
m_current=0
b_current=0
Updated_M_current=4.96
Md=-62
Updated_b_current=1.44
bd=-18
 54
 Iteration 1
Mean square error= 71.1056
Y_predicted=[6.4,11.36,16.32,21.28,26.24]
M_current=4.96
b_current=1.44
Updated_M_current=0.4992
updated_b_current=0.2688
Md=55.76
bd=14.64
 55
 Iteration 2
MSE=56.823
Y_predicted=[0.768,1.2672,1.7664,2.2656,2.7648]
M_current=0.4992
b_current=0.2688
Updated_M_current=4.451584
Updated_b_current=1.426176
Md=-49.4048
bd=-14.4672
 iteration 3 is also continue with the same process
and so on .
 56
 Therefore the final model become
Y=mx+b , at iteration 10,000
m= 2.000000000000001,
b= 2.9999999999999947
So the best model is equal to
Y= 2.000000000000001 (Xi)+2.9999999999999947
Prove assume x=5
So y= 12.999999999999998 ≈13 which is correct.
 57
Implementing by program
 58
Construction linear regression mathematically
 59
Estimated mode mathematically
 60
Linear regression with built in module
 61
Constructed linear regression model
 62
 Cont.
• Therefore the line constructed by mathematical
 operation almost similar with what we constructed
 with linear regression model.
• So it so easy to use built in model rather than
 suffering with mathematical operation.
 63
Visualization of gradient descent to best model
 64
 Given dataset
 65
Linear regression mode of given
 dataset
 66
2. What Is Logistic Regression?
• Logistic regression is a supervised machine learning
 algorithm that accomplishes binary classification tasks
 by predicting the probability of an outcome, event, or
 observation. The model delivers a binary or
 dichotomous outcome limited to two possible
 outcomes: yes/no, 0/1, or true/false.
 68
 Cont.
Ordinal logistic regression
 is used to predict the probability of an outcome
 that falls into a predetermined order, such as the
 level of customer satisfaction, the severity of a
 disease, or the stage of cancer.
 69
Why do we use Logistic Regression rather than Linear Regression?
Here the threshold value is 0.5, which means if the value of h(x) is greater
than 0.5 then we predict malignant tumor (1) and if it is less than 0.5 then we
predict benign tumor (0). Everything seems okay.
 70
Cont..
 71
Sigmoid function
 72
Linear regression vs logistic regression
1. Output: - Linear Regression: Predicts continuous values. For example,
 predicting house prices. Where as Logistic Regression: Predicts binary
 or categorical values, often representing probabilities. For example,
 predicting whether an email is spam or not (1 or 0).
2. Cost Function: Linear Regression: Uses Mean Squared Error (MSE) as
 the cost function. - Logistic Regression: Uses Log Loss (Cross-Entropy)
 as the cost function.
3. Assumptions: - Linear Regression: Assumes a linear relationship
 between the features and the target variable. - Logistic Regression:
 Assumes a linear relationship between the features and the log-odds
 of the target variable.
4. Decision Boundary: - Linear Regression: Does not have a concept of a
 decision boundary. - Logistic Regression: Separates classes using a
 sigmoid function to create a decision boundary.
5. Evaluation Metrics - Linear Regression: Evaluated using metrics like
 Mean Squared Error (MSE) and R-squared. - Logistic Regression:
 Evaluated using metrics like Accuracy, Precision, Recall, and F1-score.
 73
Transforming linear to logistic
 regression
 74
 Reading Assignment
• Dig out the basic mathematics behind logistic
 regression by your self.
 75
 3. Decision Tree algorithm
• Decision Tree is a Supervised learning
 technique that can be used for both classification
 and Regression problems, but mostly it is preferred
 for solving Classification problems.
• In a Decision tree, there are two nodes, which are
 the Decision Node and Leaf Node. Decision nodes
 are used to make any decision and have multiple
 branches, whereas Leaf nodes are the output of
 those decisions and do not contain any further
 branches.
 76
 Decision Tree Terminologies
• Root Node- It is the topmost node in the tree, which
 represent the complete dataset.
• Decision/Internal Node- Decision nodes are nothing but the
 result in the splitting of data into multiple data segments
 and main goal is to have the children nodes with maximum
 homogeneity or purity
• Leaf/Terminal Node- This node represent the data section
 having highest homogeneity (means all of the same kind).
• Entropy-It is used for checking the impurity or uncertainty
 present in the data. Entropy is used to evaluate the quality
 of a split. When entropy is zero the sample is completely
 homogeneous, meaning that each instance belongs to the
 same class and entropy is one when the sample is equally
 divided between different classes.
 77
Cont.
 78
 Entropy
 79
 Information Gain (IG)
• The information gain is based on the decrease in
 entropy after a dataset is split on an attribute.
• Which attribute creates the most homogeneous
 branches?
• First the entropy of the total dataset is calculated.
• The dataset is then split on the different attributes.
• The entropy for each branch is calculated. Then it is
 added proportionally, to get total entropy for the
 split.
• The resulting entropy is subtracted from the entropy
 before the split.
• The attribute that yields the largest IG is chosen for the
 decision node.
 80
 IG formula
Information Gain= Entropy(S)-
 [(Weighted Avg) *Entropy(each feature)
Or
Information gain of a parent node can be calculated
as the entropy of the parent node subtracted entropy
of the weighted average of the child node.
Or
 81
• A branch set with entropy of 0 is a leaf node.
• Otherwise, the branch needs further splitting to
 classify its dataset.
• The ID3 algorithm is run recursively on the non-
 leaf branches, until all data is classified.
 82
 What is ID3?
• A mathematical algorithm for building the decision
 tree.
• Invented by J. Ross Quinlan in 1979.
• Uses Information Theory invented by Shannon in
 1948.
• Builds the tree from the top down, with no
 backtracking.
• Information Gain is used to select the most useful
 attribute for classification.
 83
 Cont.
• Let's illustrate this with help of an example. Let's assume we
 want to play tennis on a particular day — say Saturday —
 how will you decide whether to play or not. Let's say you
 go out and check if it's hot or cold, check the speed of the
 wind and humidity, how the weather is, i.e. is it sunny,
 cloudy, or rainy. You take all these factors into account to
 decide if you want to play or not.
• So, you calculate all these factors for the last 14 days and
 form a lookup table like the one below.
• Now, you may use this table to decide whether to play or
 not. But, what if the weather pattern on Saturday does not
 match with any of rows in the table? This may be a
 problem. A decision tree would be a great way to
 represent data like this because it takes into account all the
 possible paths that can lead to the final decision by
 following a tree-like structure.
 84
Given a dataset
 85
 Question
1. Construct the decision tree by ID3 algorithm
2. By given the following criteria
1. day 15 :
2. Weather: sunny
3. Temperature : Hot
4. Humidity : Normal
5. Wind : weak
6. Play ?
 86
 Answer
• Entropy= 0.94
• Entropy (sunny)=0.971
• Entropy(Overcast)=0
• Entropy(rain)=0.971
• Gain(Entropy, Outlook)=0.2464
• Gain(Entropy, Tem)=0.0289
• Gain(Entropy, Humidity)=0.1516
• Outlook is selected b/c it has high information gain
 87
 Cont.
• Gain(Entropy of sunny,Temp)=0.570
• Gain(Entropy of sunny,Humidity)=0.97
• Gain(Entropy of sunny,wind)=0.0192
• Now the Humidity is selected b/c of high
 information gain.
• Gain(Entropy of rain,Temp)=0.0192
• Gain(Entropy of rain,Humidity)=0.0192
• Gain(Entropy of rain,wind)=0.97
• Now the Wind is selected b/c of high information
 gain.
 88
Finally tree become
 89
 Answer to question 2
• In Day 15 guess that the player can play or not ?
• Yes , the player can play in day 15
 90
 Advantages of the Decision Tree
• t is simple to understand as it follows the same
 process which a human follow while making any
 decision in real-life.
• It can be very useful for solving decision-related
 problems.
• It helps to think about all the possible outcomes
 for a problem.
• There is less requirement of data cleaning
 compared to other algorithms.
 91
 Disadvantages of the Decision Tree
• The decision tree contains lots of layers, which
 makes it complex.
• It may have an overfitting issue, which can be
 resolved using the Random Forest algorithm.
• For more class labels, the computational
 complexity of the decision tree may increase.
 92
 3. Random Forest Tree
• Random Forest Algorithm widespread popularity
 stems from its user-friendly nature and adaptability,
 enabling it to tackle both classification and
 regression problems effectively.
• One of the most important features of the
 Random Forest Algorithm is that it can handle the
 data set containing continuous variables, as in the
 case of regression, and categorical variables, as in
 the case of classification. It performs better for
 classification and regression tasks.
 93
 Cont.
• Random Forest is a classifier that contains a
 number of decision trees on various subsets of the
 given dataset and takes the average to improve the
 predictive accuracy of that dataset.
• Instead of relying on one decision tree, the
 random forest takes the prediction from each tree
 and based on the majority votes of predictions,
 and it predicts the final output.
 94
The below diagram explains the working of the
 Random Forest algorithm
 95
Decision tree vs random forest
 96
 Advantages of Random Forest:
• Robustness: Random Forest is highly robust to noisy
 data and outliers.
• Feature Importance: Random Forest provides a
 measure of feature importance, indicating which
 features contribute the most to the predictions. This
 information can be valuable for feature selection and
 understanding the underlying relationships in the data.
• Non-linearity Handling: Random Forest can effectively
 model complex, non-linear relationships between
 input features and output variables.
• Scalability: Random Forest can handle large datasets
 with a high number of features. It can efficiently
 parallelize the training process,
 97
 Disadvantages of random Forest algorithm
1. Random forests can be computationally expensive and
time-consuming, especially when dealing with a large
number of trees or features.
2. Random forests may not perform well on imbalanced
datasets, where one class is significantly more prevalent
than others, as they tend to favor majority classes.
4. Hyperparameter Tuning: Random Forest has several
hyperparameters, such as the number of trees, the depth
of trees, and the number of features considered at each
split. Proper tuning of these parameters is essential to
optimize the model’s performance, which may require
additional computational resources.
 98
 4. Support vector machine Algorithm
• Support Vector Machine or SVM is one of the most
 popular Supervised Learning algorithms, which is
 used for Classification as well as Regression
 problems. However, primarily, it is used for
 Classification problems in Machine Learning.
• The goal of the SVM algorithm is to create the best
 line or decision boundary that can segregate n-
 dimensional space into classes so that we can easily
 put the new data point in the correct category in
 the future. This best decision boundary is called a
 hyperplane.
 99
 Cont.
• SVM chooses the extreme points/vectors that help
 in creating the hyperplane. These extreme cases are
 called as support vectors, and hence algorithm is
 termed as Support Vector Machine.
• Consider the below diagram in which there are
 two different categories that are classified using a
 decision boundary or hyperplane:
 100
Cont.
 101
Cont.
 102
 Support Vector Machines
 B
 1
• One Possible
 Solution
 Support Vector Machines
 B
 2
 B
 2
 B
 2
 b12 B
 2
• Find hyperplane b2
 maximizes the 1
 b2
 margin => B1 is 2
 better than B2
 margin b1
 1
 The best hyperplane is that plane that has the maximum distance from both the
 classes, and this is the main aim of SVM.
Cont.
 108
Linear SVM
 109
Cont.
 110
 Cont.
• If the data points are not linearly separable, SVM
 can still be used by applying a technique called the
 kernel trick. This involves mapping the data points
 into a higher-dimensional space where they
 become linearly separable.
• Common kernel functions used for this purpose
 include polynomial kernels, Gaussian radial basis
 function (RBF) kernels, and sigmoid kernels. By
 using these kernels, SVM can effectively classify
 non-linearly separable data points.
 111
 Unsupervised Machine Learning
• unsupervised learning is a machine learning technique in which models
 are not supervised using training dataset. Instead, models itself find the
 hidden patterns and insights from the given data.
Why use Unsupervised Learning?
 Unsupervised learning is helpful for finding useful insights
 from the data.
 Unsupervised learning is much similar as a human learns
 to think by their own experiences, which makes it closer
 to the real AI.
 Unsupervised learning works on unlabeled and
 uncategorized data which make unsupervised learning
 more important.
 In real-world, we do not always have input data with
 the corresponding output so to solve such cases, we need
 unsupervised learning.
 112
 Unsupervised Learning algorithms:
• K-means clustering
• Hierarchal clustering
• Anomaly detection
• Neural Networks
• Principle Component Analysis
• Independent Component Analysis
• Apriori algorithm
 113
Difference between Supervised and Unsupervised Learning
 114
 K-Means Clustering Algorithm
• K-Means Clustering is an unsupervised learning
 algorithm that is used to solve the clustering
 problems in machine learning or data science.
• What is K-Means Algorithm?
• K-Means Clustering is an Unsupervised Learning
 algorithm, which groups the unlabeled dataset into
 different clusters. Here K defines the number of
 pre-defined clusters that need to be created in the
 process, as if K=2, there will be two clusters, and
 for K=3, there will be three clusters, and so on.
 115
How does the K-Means Algorithm Work
The working of the K-Means algorithm is explained in the
below steps:
Step-1: Select the number K to decide the number of clusters.
Step-2: Select random K points or centroids. (It can be other
from the input dataset).
Step-3: Assign each data point to their closest centroid, which
will form the predefined K clusters.
Step-4: Calculate the variance and place a new centroid of
each cluster.
Step-5: Repeat the third steps, which means reassign each
datapoint to the new closest centroid of each cluster.
Step-6: If any reassignment occurs, then go to step-4 else go
to FINISH.
Step-7: The model is ready.
 116
 The k-means clustering algorithm
 mainly performs two tasks:
• The k-means clustering algorithm mainly performs
 two tasks:
• Determines the best value for K center points or
 centroids by an iterative process.
• Assigns each data point to its closest k-center.
 Those data points which are near to the particular
 k-center, create a cluster.
 117
Cont.
 118
 Example
A Give the following data point compute the K-mean clustering , assume K=3 with
a given cettroid
 119
 Answer
Cluster1 =[(2,10),(5,8),(4,9)]
Cluster 2=[(8,4),(7,5) (6,4)]
Cluster 3=[(2,5),(1,2)]
 120
 How to choose the value of "K number of clusters"
 121
Computing Mean Square Error
 122
Visualizing Elbow
 123
 Reading assignment
• Apriori algorithm for association rule learning
 problems.
• LDA for topic modeling of text passages, i.e.,
 discover and associate keywords to text.
 124
Model Evaluation Techniques in Machine Learning
 125
Cont.
 126
 Cont..
• Overfitting: Occurs when the model is so
 closely aligned to the training data that it does not
 know how to respond to new data.
• Underfitting: Occurs when the model cannot
 adequately capture the underlying structure of the
 data.
• Right Fit: Occurs when both the training data error
 and the test data are minimal
 127
 Evaluation Metrics
• There are different metrics for the tasks of
 classification, regression, ranking, clustering, topic
 modeling, etc. Some of the metrics are as follows:
 Classification Metrics (accuracy, precision, recall,
 F1-score, ROC, AUC, …)
 Regression Metrics (MSE, MAE, R2)
 Ranking Metrics (MRR, DCG, )
 Statistical Metrics (Correlation)
 Computer Vision Metrics (PSNR, SSIM, IoU)
 NLP Metrics (Perplexity, BLEU score)
 128
 1. Classification Metrics
Confusion Matrix: A confusion matrix is a table that
is often used to describe the performance of a
classification model (or ―classifier‖) on a set of test
data for which the true values are known.
 It presents a summary of the predicted and actual
 classifications of a model, showing the number of
 true positive, true negative, false positive, and false
 negative predictions. This allows for a detailed
 analysis of the model's performance in terms of
 accuracy, precision, recall, and other metrics.
 129
130
Visualization of CM
 131
 Accuracy
• Accuracy is one metric for evaluating classification
 models. Formally accuracy could be defined as the
 number of correct predictions to a total number of
 predictions.
 132
• Precision in machine learning is a metric that
 measures the accuracy of the positive predictions
 made by a model.
• It is calculated as the number of true positive
 predictions divided by the total number of positive
 predictions made by the model.
• A high precision value indicates that the model is
 making accurate positive predictions, while a low
 precision value indicates that the model is making
 incorrect positive predictions.
 133
 Recall
• Recall is a metric used to evaluate the performance
 of a classification model. It measures the ability of
 the model to correctly identify all relevant
 instances, or the proportion of true positive
 instances that were correctly predicted by the
 model out of all actual positive instances.
• A high recall value indicates that the model is good
 at identifying all positive instances, while a low
 recall value suggests that the model is missing a
 significant number of positive instances.
 134
F1-score
 135
 Text Feature Extraction
• The following are the most common feature
 extraction in machine learning algorithm
1. Bow(Bag of word )
2. TF-IDF (Term frequency inverse document
 frequency )
3. Word2vec
4. FastText
5. Embedding layer for neural network
 136
 What is a Bag-of-Words?
• A bag-of-words model, or BoW for short, is a way of
 extracting features from text for use in modeling, such
 as with machine learning algorithms.
• The approach is very simple and flexible, and can be
 used in a myriad of ways for extracting features from
 documents.
• A bag-of-words is a representation of text that
 describes the occurrence of words within a document.
 It involves two things:
• A vocabulary of known words.
• A measure of the presence of known words.
 137
Example of the Bag-of-Words Model
 138
 Step 3: Create Document Vectors
One drawback of Bag of Words (BoW) in feature extraction is that it does not
consider the semantic meaning of words or the context in which they are used. BoW
treats each word as a separate entity and disregards the order and relationships
between words in a sentence, which can lead to a loss of important information.
Additionally, BoW can result in high-dimensional sparse vectors, which may require
significant computational resources and can be challenging to process efficiently.
 139
 Limitations of Bag of Words
• No Word Order: It doesn’t care about the order of
 words, missing out on how words work together.
• Ignores Context: It doesn’t understand the meaning of
 words based on the words around them.
• Always Same Length: It always represents text in the
 same way, which can be limiting for different types of
 text.
• Lots of Words: It needs to know every word in a
 language, which can be a huge list to handle.
• No Meanings: It doesn’t understand what words
 mean, only how often they appear, so it can’t grasp
 synonyms or different word forms.
 140
2. TF-IDF (Term frequency inverse document frequency )
 141
 Term Frequency (TF)
• Let’s first understand Term Frequent (TF). It is a
 measure of how frequently a term, t, appears in a
 document, d:
•
 142
Consider the following reviews about a
particular horror movie
 143
We can calculate the term frequencies for all the terms and all the review
 144
 Inverse Document Frequency (IDF)
• IDF is a measure of how important a term is. We
 need the IDF value because computing just the TF
 alone is not sufficient to understand the
 importance of words:
 145
Cont.
 146
The IDF values for the entire vocabulary
 would be
 147
 Cont…
• Hence, we see that words like ―is‖, ―this‖, ―and‖,
 etc., are reduced to 0 and have little importance;
 while words like ―scary‖, ―long‖, ―good‖, etc. are
 words with more importance and thus have a
 higher value.
• We can now compute the TF-IDF score for each
 word in the corpus. Words with a higher score are
 more important, and those with a lower score are
 less important:
 148
we can calculate the TF-IDF scores for all the words
 149
 Cont.
• We have now obtained the TF-IDF scores for our
 vocabulary. TF-IDF also gives larger values for less
 frequent words and is high when both IDF and TF
 values are high i.e the word is rare in all the
 documents combined but frequent in a single
 document.
 150
 Limitation of TF-IDF
1. Lack of semantic understanding: TF-IDF does not
consider the semantic meaning of words, which can lead to
inaccuracies in representing the importance of terms.
2. Sensitivity to document length: TF-IDF weights terms
based on their frequency in a document, which can be
biased towards longer documents.
3. Difficulty with rare terms: TF-IDF may not effectively
capture the importance of rare terms that are present in
only a few documents.
4. Inability to handle synonyms and polysemous words: TF-
IDF may struggle to differentiate between different
meanings of a word or variations of the same concept.
 151
 Word2vec Feature Extraction
• Word2Vec is a widely used method in natural
 language processing (NLP) that allows words to be
 represented as vectors in a continuous vector
 space. Word2Vec is an effort to map words to
 high-dimensional vectors to capture the semantic
 relationships between words, developed by
 researchers at Google.
• Words with similar meanings should have similar
 vector representations, according to the main
 principle of Word2Vec. Word2Vec utilizes two
 architectures:
 152
 1. CBOW (Continuous Bag of Words):
• The CBOW model predicts the current word given
 context words within a specific window.
• The input layer contains the context words and the
 output layer contains the current word.
 153
 2. Skip Gram
• Skip gram predicts the surrounding context words
 within specific window given current word.
• The input layer contains the current word and the
 output layer contains the context words.
 154
 Cbow vs Skip gram
1. Architecture:
• - CBOW: CBOW predicts the target word based on the context words surrounding
 it. It uses the context words as input and tries to predict the target word.
• - Skipgram: Skipgram does the opposite of CBOW. It predicts the context words
 given the target word.
2. Training Efficiency:
• - CBOW: CBOW is faster to train compared to Skipgram because it aggregates
 context words to predict the target word.
• - Skipgram: Skipgram is slower to train compared to CBOW because it predicts
 context words for each target word.
3. Performance:
• - CBOW: CBOW works well with frequent words and is suitable for small datasets.
• - Skipgram: Skipgram works well with infrequent words and is suitable for large
 datasets.
4. Usage:
• - CBOW: CBOW is used when the context words contain enough information to
 predict the target word accurately.
• - Skipgram: Skipgram is used when the target word contains enough information to
 predict the context words accurately.
 155
 Artificial Neural Network
• Artificial Neural Networks (ANN) are algorithms based
 on brain function and are used to model complicated
 patterns and forecast issues. The Artificial Neural
 Network (ANN) is a deep learning method that arose
 from the concept of the human brain Biological Neural
 Networks.
• The development of ANN was the result of an attempt
 to replicate the workings of the human brain.
• The workings of ANN are extremely similar to those of
 biological neural networks, although they are not
 identical. ANN algorithm accepts only numeric and
 structured data.
 156
Artificial Neural Networks Architecture
1. There are three layers in the network architecture:
 the input layer, the hidden layer (more than one),
 and the output layer.
 157
 Cont.
• 2. It is possible to think of the hidden layer as a
 ―distillation layer,‖ which extracts some of the
 most relevant patterns from the inputs and sends
 them on to the next layer for further analysis.
• It accelerates and improves the efficiency of the
 network by recognizing just the most important
 information from the inputs and discarding the
 redundant information.
 158
 Cont..
3. The activation function is important for two
reasons: first, it allows you to turn on your
computer.
• This model captures the presence of non-linear
 relationships between the inputs.
• It contributes to the conversion of the input into a
 more usable output.
 159
Activation function
 160
 Cont.
4. Finding the ―optimal values of W — weights‖
that minimize prediction error is critical to building a
successful model. The ―backpropagation algorithm‖
does this by converting ANN into a learning
algorithm by learning from mistakes.
5. The optimization approach uses a ―gradient
descent‖ technique to quantify prediction errors. To
find the optimum value for W, small adjustments in
W are tried, and the impact on prediction errors is
examined.
 161
Different types of Neural Network
(i) ANN
• ANN is also known as an artificial neural network. It is
 a feed-forward neural network because the inputs are
 sent in the forward direction. It can also contain
 hidden layers which can make the model even denser.
 They have a fixed length as specified by the
 programmer.
(ii) CNN
• Convolutional Neural Networks is mainly used for
 Image Data. It is used for Computer Vision. Some of
 the real-life applications are object detection in
 autonomous vehicles. It contains a combination of
 convolutional layers and neurons. It is more powerful
 than both ANN and RNN.
 162
 Cont.
(iii) RNN
• It is also known as Recurrent Neural Networks. It
 is used to process and interpret time series data. In
 this type of model, the output from a processing
 node is fed back into nodes in the same or
 previous layers. The most known types of RNN
 are LSTM (Long Short Term Memory) Networks
 163
How Convolutional Neural Networks works
• A Convolutional Neural Network, also known as
 CNN or ConvNet, is a class of neural
 networks that specializes in processing data that
 has a grid-like topology, such as an image. A digital
 image is a binary representation of visual data. It
 contains a series of pixels arranged in a grid-like
 fashion that contains pixel values to denote how
 bright and what color each pixel should be.
 164
 Cont.
• How does it work?
• Before we go to the working of CNN’s let’s cover
 the basics such as what is an image and how is it
 represented. An RGB image is nothing but a matrix
 of pixel values having three planes whereas a
 grayscale image is the same but it has a single
 plane.
 165
Take a look at this image to understand more.
 166
 Kernel (filter, feature detector )
• Kernel is the filter that is used to extract the feature
 from the images
• Ouputsize=input_size-K+1
 167
 Cont.
• For simplicity, let’s stick with grayscale images as
 we try to understand how CNNs work.
 168
 Cont.
• The above image shows what a convolution is. We
 take a filter/kernel(3×3 matrix) and apply it to the
 input image to get the convolved feature. This
 convolved feature is passed on to the next layer.
 169
Cont.
 170
 What Is a Pooling Layer?
 171
Pooling technique in CNN
 172
 Cont. ..
• So what we do in Max Pooling is we find the
 maximum value of a pixel from a portion of the image
 covered by the kernel. Max Pooling also performs as
 a Noise Suppressant. It discards the noisy activations
 altogether and also performs de-noising along with
 dimensionality reduction.
• On the other hand, Average Pooling returns
 the average of all the values from the portion of the
 image covered by the Kernel. Average Pooling simply
 performs dimensionality reduction as a noise
 suppressing mechanism. Hence, we can say that Max
 Pooling performs a lot better than Average Pooling.
 173
Cont.
 174
 RNN
• A Deep Learning approach for modeling sequential
 data is Recurrent Neural Networks (RNN). RNNs
 were the standard suggestion for working with
 sequential data before the advent of attention
 models. Specific parameters for each element of
 the sequence may be required by a deep feed
 forward model. It may also be unable to
 generalize to variable-length sequences.
 175
 What is a Recurrent Neural Network?
• Training a typical neural network involves the following
 steps:
1. Input an example from a dataset.
2. The network will take that example and apply some
 complex computations to it using randomly initialised
 variables (called weights and biases).
3. A predicted result will be produced.
4. Comparing that result to the expected value will give us
 an error.
5. Propagating the error back through the same path will
 adjust the variables.
6. Steps 1–5 are repeated until we are confident to say that
 our variables are well-defined.
7. A predication is made by applying these variables to a
 new unseen input.
 176
Machine learning vs Deep learning
 177
 Thank You!
? 178