Class Notes ML 1
Class Notes ML 1
UNIT-1
Introduction to Machine Learning
Machine learning (ML) is a subset of artificial intelligence (AI) that focuses on enabling
systems to learn from data and make decisions or predictions without being explicitly
programmed. Rather than following predetermined rules, ML algorithms allow computers to
recognize patterns, make inferences, and improve their performance over time through
experience.
Machine learning refers to the field of study that gives computers the ability to learn from
data, identify patterns, and make decisions with minimal human intervention. This involves
using algorithms to analyze and model data to make predictions or decisions.
1. Supervised Learning:
o In supervised learning, the model is trained on labeled data (data that has both
input and corresponding output). The algorithm learns by example and
generalizes patterns to make predictions on new, unseen data.
o Example: Classifying emails as "spam" or "not spam."
2. Unsupervised Learning:
o In unsupervised learning, the model is given data without labels. The goal is to
find hidden patterns or structures within the data.
o Example: Customer segmentation, where a model groups customers based on
their purchasing behavior without predefined categories.
3. Reinforcement Learning:
o In reinforcement learning, an agent learns to make decisions by performing
actions in an environment to maximize a cumulative reward. The agent
receives feedback in the form of rewards or penalties based on its actions.
o Example: Training a robot to navigate through a maze.
4. Semi-supervised Learning (Optional category):
o This is a hybrid between supervised and unsupervised learning. It uses a small
amount of labeled data and a large amount of unlabeled data.
o Example: Image recognition where only a few images are labeled, but the
model can learn from a large amount of unlabeled images.
1. Linear Regression:
o Used for predicting continuous values, such as house prices based on features
like size, location, etc.
2. Decision Trees:
o A model that makes decisions based on answering questions (features) and
traversing a tree structure to reach a prediction.
3. Random Forest:
o An ensemble method that uses multiple decision trees to improve performance
and reduce overfitting.
4. K-Nearest Neighbors (KNN):
o A classification algorithm that makes predictions based on the majority class
of the nearest data points in the feature space.
5. Support Vector Machines (SVM):
o A powerful classifier that tries to find the optimal hyperplane separating
classes in a high-dimensional space.
6. Neural Networks:
o Inspired by the human brain, these networks consist of layers of
interconnected nodes (neurons) and are used in deep learning applications like
image and speech recognition.
7. K-means Clustering:
o A method for unsupervised learning where the algorithm groups data into k
clusters based on feature similarity.
The typical workflow of a machine learning project involves several key stages:
1. Data Collection: Gathering the relevant data for the problem you're solving.
2. Data Preprocessing: Cleaning and preparing the data for modeling, including
handling missing values, normalizing, and encoding categorical variables.
3. Model Selection: Choosing the appropriate machine learning algorithm.
4. Training the Model: Feeding the training data into the model to allow it to learn
from the data.
5. Evaluation: Assessing the model's performance using metrics like accuracy,
precision, recall, or mean squared error (for regression).
6. Hyperparameter Tuning: Adjusting the model parameters for optimal performance.
7. Deployment: Integrating the trained model into a real-world application.
The evolution of machine learning (ML) can be traced through various stages, each
representing a significant leap in our understanding and capability to make machines
"intelligent." From its early theoretical foundations to the modern-day applications powered
by deep learning, machine learning has undergone a profound transformation. Below is an
overview of the key milestones in the evolution of machine learning:
The roots of machine learning can be traced back to the early days of computing and artificial
intelligence (AI):
Turing's Work (1936-1937): The British mathematician Alan Turing laid the
groundwork for theoretical computing with the concept of the Turing machine
Neural Networks and Perceptrons (1950s): In the 1950s, early work on neural
networks began with the creation of the Perceptron by Frank Rosenblatt.
Rule-based AI: During the 1950s to 1970s, AI research was dominated by symbolic
approaches..Early ML Algorithms: Researchers began exploring algorithms like
decision trees and clustering methods, though the field was still in its infancy
Despite early successes, progress in AI and machine learning slowed significantly during this
period due to limited computational resources and overly optimistic expectations:
Challenges in Data and Computing: The limitations of computers at the time, both
in terms of memory and processing power, constrained the development of more
advanced ML algorithms. Additionally, AI and machine learning models struggled to
perform well in real-world, noisy data scenarios.
AI Winter: This term refers to a period of reduced funding and interest in AI research
during the late 1970s to early 1990s, as results from early ML models did not live up
to expectations.
The 2010s saw significant breakthroughs in machine learning, particularly in the area of deep
learning:
Big Data:.
Rise of Deep Learning: Deep learning :
o ImageNet Breakthrough (2012): The ImageNet competition marked a
pivotal moment when deep learning models, especially convolutional neural
networks (CNNs), drastically outperformed traditional machine learning
algorithms in image classification tasks. This achievement sparked widespread
interest in deep learning.
Natural Language Processing (NLP) and Transformer Models: In the field of
NLP, algorithms like Word2Vec and later transformers (such as BERT and GPT)
revolutionized language understanding and generation, allowing machines to achieve
human-level performance on tasks like translation, question answering, and text
generation.
Reinforcement Learning Advancements: Reinforcement learning, notably through
deep Q-learning (DeepMind's AlphaGo playing Go), reached new heights, solving
complex decision-making problems.
Machine learning (ML) can be broadly categorized into different paradigms based on how the
algorithms learn from data and the nature of the problem they aim to solve. Each paradigm
has distinct approaches, techniques, and applications. The most common paradigms are
supervised learning, unsupervised learning, reinforcement learning, and semi-
supervised learning. Below is a detailed explanation of each paradigm:
1. Supervised Learning
Definition: In supervised learning, the model is trained using labeled data. The algorithm
learns the relationship between input data (features) and their corresponding outputs (labels)
during training, and then generalizes this knowledge to make predictions on new, unseen
data.
How it Works:
o Training Data: Consists of input-output pairs where the output (label) is
known.
o Objective: The goal is to learn a mapping function f(x)f(x)f(x) that maps
inputs xxx to outputs yyy, so the model can predict the output for new, unseen
inputs.
Applications:
o Image classification (e.g., identifying cats and dogs in pictures).
o Email spam detection.
o Sentiment analysis (classifying reviews as positive or negative).
2. Unsupervised Learning
Definition: In unsupervised learning, the model is provided with data that has no labels. The
goal is to identify underlying patterns or structures in the data without predefined outputs.
How it Works:
o Training Data: Contains only input data, and the algorithm must find hidden
patterns or structures within this data.
o Objective: The model aims to explore the data and organize it into clusters,
dimensions, or other structures.
Applications:
o Customer segmentation (grouping customers based on purchasing behavior).
o Anomaly detection (detecting fraud or network intrusions).
o Topic modeling (grouping documents into topics based on word patterns).
3. Reinforcement Learning
How it Works:
o Agent: The decision-making entity that interacts with the environment.
o Environment: The world in which the agent operates.
o State: The current situation or configuration of the environment.
o Action: The choices the agent can make in the environment.
o Reward: The feedback the agent receives after taking an action. The objective
is to maximize the total reward over time.
o Policy: A strategy the agent uses to determine actions based on states.
o Value Function: A function that estimates the expected return (reward) for
each state.
Types of Reinforcement:
1. Positive: Positive Reinforcement is defined as when an event, occurs due
to a particular behavior, increases the strength and the frequency of the
behavior. In other words, it has a positive effect on behavior.
Advantages of reinforcement learning are:
Maximizes Performance
Sustain Change for a long period of time
Too much Reinforcement can lead to an overload of states which can
diminish the results
2. Negative: Negative Reinforcement is defined as strengthening of behavior
because a negative condition is stopped or avoided.
Advantages of reinforcement learning:
Increases Behavior
Provide defiance to a minimum standard of performance
It Only provides enough to meet up the minimum behavior
Elements of Reinforcement Learning
i) Policy: Defines the agent’s behavior at a given time.
ii) Reward Function: Defines the goal of the RL problem by providing
feedback.
iii) Value Function: Estimates long-term rewards from a state.
iv) Model of the Environment: Helps in predicting future states and rewards
for planning.
Applications:
o Game playing (e.g., AlphaGo, chess, and video games).
o Robotics (e.g., teaching robots to walk or pick objects).
o Autonomous vehicles (e.g., self-driving cars making navigation decisions).
o Healthcare (e.g., personalized treatment recommendations).
4. Semi-Supervised Learning
Definition: Semi-supervised learning is a paradigm that falls between supervised and
unsupervised learning. The model is trained with a small amount of labeled data and a large
amount of unlabeled data. The goal is to leverage the unlabeled data to improve the learning
process.
How it Works:
o Training Data: Consists of both labeled and unlabeled data.
o Objective: The algorithm attempts to use the small amount of labeled data to
make sense of the unlabeled data and make more accurate predictions or
classifications.
Key Algorithms:
o Semi-Supervised Support Vector Machines (S3VM): An extension of SVM
that works with both labeled and unlabeled data.
o Self-training: An iterative approach where the model initially trains on the
labeled data, and then iteratively labels the unlabeled data to improve
performance.
o Graph-based Methods: Use relationships (edges) between labeled and
unlabeled data points to propagate labels.
Applications:
o Image recognition: Labeled images may be scarce, but large unlabeled
datasets can help improve accuracy.
o Speech recognition: Labeled audio data may be limited, but using large
amounts of unlabeled audio can enhance training.
o Medical diagnostics: Labeling medical images can be time-consuming and
expensive, so semi-supervised techniques help utilize the available data
effectively.
Definition: Self-supervised learning is a technique where a model generates its own labels
from the data itself. It creates a pretext task (an auxiliary task) to learn useful representations
from unlabeled data, which can then be fine-tuned for specific downstream tasks.
How it Works:
o Pretext Task: A task that is designed so the model learns useful features from
unlabeled data. For instance, predicting missing parts of an image, filling in
missing words in a sentence, or predicting the next frame in a video.
o Transfer Learning: The representations learned from the pretext task are
transferred to solve other tasks that require supervised learning.
Key Algorithms:
o Contrastive Learning: A method where the model learns by contrasting
positive and negative samples (e.g., SimCLR, MoCo).
o Masked Language Models (MLMs): In NLP, models like BERT are pre-
trained on tasks like predicting masked words to understand language
representations.
Applications:
o Natural Language Processing: Pretraining language models (e.g., GPT,
BERT) using self-supervised learning.
o Computer Vision: Pretraining models for tasks like object detection using
unlabeled images.
o Robotics: Learning representations from raw sensory data without explicit
supervision.
3. Learning by Rote:
Learning by rote, also known as rote learning, is a method of learning that involves
memorization without understanding the underlying meaning or principles of the material. It
typically relies on repetition to commit information to memory.
Limited Application: Knowledge may not transfer well to new situations or problem-
solving tasks.
Lack of Engagement: Can be monotonous and less engaging, leading to decreased
motivation.
Shallow Learning: Students may forget material quickly if it's not linked to
understanding.
4 Learning by Induction:
1. Training Data: The model is trained on labeled examples (in supervised learning) or
unlabeled examples (in unsupervised learning) to discover patterns.
2. Generalization: The goal is to enable the model to apply the rules or relationships it
has inferred from the training data to new, unseen instances.
3. Inductive Bias: Assumptions the model makes about the underlying data distribution,
guiding its learning process and influencing its generalization.
Induction in ML Paradigms:
1. Supervised Learning:
o Uses labeled datasets (input-output pairs).
o Example: Inferring the relationship y=f(x)y = f(x)y=f(x) where xxx is the
input and yyy is the output.
o Models: Decision trees, neural networks, support vector machines.
2. Unsupervised Learning:
o Identifies patterns or structures in unlabeled data.
o Example: Clustering data points based on similarities.
o Models: K-means, hierarchical clustering, PCA.
3. Reinforcement Learning:
o Learns from rewards or penalties by interacting with an environment.
o Example: Inferring an optimal strategy in a game.
Flexibility: Works well across various domains with diverse types of data.
Scalability: Can handle large datasets with appropriate algorithms.
Automation: Reduces the need for manual rule-setting by discovering patterns from
data.
Example:
5. Reinforcement Learning:
How it Works:
o Agent: The decision-making entity that interacts with the environment.
o Environment: The world in which the agent operates.
o State: The current situation or configuration of the environment.
o Action: The choices the agent can make in the environment.
o Reward: The feedback the agent receives after taking an action. The objective
is to maximize the total reward over time.
o Policy: A strategy the agent uses to determine actions based on states.
o Value Function: A function that estimates the expected return (reward) for
each state.
Types of Reinforcement:
4. Positive: Positive Reinforcement is defined as when an event, occurs due
to a particular behavior, increases the strength and the frequency of the
behavior. In other words, it has a positive effect on behavior.
Advantages of reinforcement learning are:
Maximizes Performance
Sustain Change for a long period of time
Too much Reinforcement can lead to an overload of states which can
diminish the results
5. Negative: Negative Reinforcement is defined as strengthening of behavior
because a negative condition is stopped or avoided.
Advantages of reinforcement learning:
Increases Behavior
Provide defiance to a minimum standard of performance
It Only provides enough to meet up the minimum behavior
Elements of Reinforcement Learning
i) Policy: Defines the agent’s behavior at a given time.
ii) Reward Function: Defines the goal of the RL problem by providing
feedback.
iii) Value Function: Estimates long-term rewards from a state.
iv) Model of the Environment: Helps in predicting future states and rewards
for planning.
Applications:
o Game playing (e.g., AlphaGo, chess, and video games).
o Robotics (e.g., teaching robots to walk or pick objects).
o Autonomous vehicles (e.g., self-driving cars making navigation decisions).
o Healthcare (e.g., personalized treatment recommendations).
6
6.TYPES OF DATA
Data in machine learning are broadly categorized into two types − numerical
(quantitative) and categorical (qualitative) data. The numerical data can be
measured, counted or given a numerical value, for example, age, height,
income, etc. The categorical data is non-numeric data that can be arranged
in categories with or without meaningful order, for example, gender, blood
group, etc.
The numerical (quantitative) data is data that can be measured, counted or given a numerical
value. The examples of numerical data are age, height, income, number of students in class,
number of books in a shelf, shoe size, etc.
The numerical data can be categorized into the folloiwng two types −
Discrete Data
Continuous Data
1. Discrete Data
The discrete data is numerical data that is countable, finite, and can only take certain values,
usually whole numbers. Examples of discrete data are number of students in class, number of
books in a shelf, shoe size, number of ducks in a pond, etc.
2. Continuous Data
The continuous data is numerical data that can take any value within a specified range
including fractions and decimals. Examples of continuous data are age, height, weight,
income, time, temperature, etc.
What is true zero?
True zero represents the absence of the quantity being measured. For example, height,
weight, age, temperature in Kelvin are examples of data with true zero. As the height with 0
CM represents the absolute absence of height, 0K temperature represents no heat. But
temperature in Celsius (or Fahrenheit) is an example of data with false zero.
We can categorize the numerical data into the following two types on basis of true zero −
interval data − quantitative data with equal intervals between data points. Examples are
temperature (Fahrenheit), temperature (Celsius), pH, SAT score (200-800), credit score (300-
850), etc.
ratio data − same as interval data but with true zero. Examples are weight in KG, number of
students, income, speed, etc.
The categorical data can be divided into the folloiwng two types −
Nominal Data
Ordinal Data
1. Nominal Data
The nominal data is categorical data that can not be arranged in an order or rank. The
examples of nominal data are gender, blood group, hair color, nationality, etc.
2. Ordinal Data
The ordinal data is categorical data can be ordered or ranked with a specific attribute. The
examples of ordinal data are the school grades, level of education, range of
income, ratings, etc.
7 Matching:
a. Exact Matching
b. Approximate Matching
Finds entities that are not exactly the same but are similar based on some criteria.
Example: Matching names with slight spelling differences (e.g., "John" and "Jon").
2. Matching Techniques
a. String Matching
b. Feature-Based Matching
c. Entity Matching
d. Image Matching
e. Graph-Based Matching
3. Applications of Matching in ML
a. Recommendation Systems
Matching users to items or items to items based on preferences or features.
b. Data Deduplication
c. Entity Resolution
d. Fraud Detection
4. Challenges in Matching
(OR)
Machine Learning (ML) involves several key stages, from problem formulation to deploying
a model. Here's a breakdown of the typical stages in an ML workflow:
1. Problem Definition
Objective: Clearly define the problem you want to solve and the desired outcomes.
Key Steps:
o Identify the type of ML task (e.g., classification, regression, clustering,
recommendation).
o Define success metrics (e.g., accuracy, F1-score, mean squared error).
2. Data Collection
Gather data relevant to the problem. Data collection is the process of collecting and
evaluating information or data from multiple sources to find answers to research
problems, answer questions, evaluate outcomes, and forecast trends and
probabilities. It is an essential phase in all types of research, analysis, and decision-
making, including that done in the social sciences, business, and healthcare.
Key Steps:
o Identify data sources (e.g., databases, APIs, sensors).
o Scrape or retrieve data from multiple channels.
o Ensure sufficient quantity and variety of data.
3. Data Preprocessing
Objective: Clean and prepare data for analysis. Data preprocessing is a process of
preparing the raw data and making it suitable for a machine learning model. It is the first
and crucial step while creating a machine learning model.
When creating a machine learning project, it is not always a case that we come across
the clean and formatted data. And while doing any operation with data, it is mandatory to
clean it and put in a formatted way. So for this, we use data preprocessing task.
Key Steps:
o Handle missing values (e.g., imputation, deletion).
o Remove duplicates and outliers.
o Normalize or standardize features.
o Convert data types (e.g., categorical to numerical encoding).
o Split data into training, validation, and testing sets.
5. Feature Engineering
Feature
Objective: Create and select features that enhance model performance.
engineering is the process of transforming raw data into features that
are suitable for machine learning models. In other words, it is the
process of selecting, extracting, and transforming the most relevant
features from the available data to build more accurate and efficient
machine learning models.
Key Steps:
o Generate new features from existing data (e.g., date-time features).
o Encode categorical variables (e.g., one-hot encoding).
o Select relevant features using techniques like correlation analysis or PCA.
o Reduce dimensionality if needed.
6. Model Selection
7. Model Training
Train the selected model on the training dataset. A training model is a dataset that is used to
train an ML algorithm. It consists of the sample output data and the corresponding sets of
input data that have an influence on the output. The training model is used to run the input
data through the algorithm to correlate the processed output against the sample output. The
result from this correlation is used to modify the model.
This iterative process is called “model fitting”. The accuracy of the training dataset or the
validation dataset is critical for the precision of the model.
Model training in machine language is the process of feeding an ML algorithm with data to
help identify and learn good values for all attributes involved. There are several types of
machine learning models, of which the most common ones are supervised and unsupervised
learning.
Key Steps:
o Optimize model parameters using training data.
o Use libraries like scikit-learn, TensorFlow, or PyTorch.
8. Model Evaluation
Model evaluation is the process that uses some metrics which help us to
analyze the performance of the model. As we all know that model
development is a multi-step process and a check should be kept on how
well the model generalizes future predictions. Therefore evaluating a
model plays a vital role so that we can judge the performance of our
model. The evaluation also helps to analyze a model’s key weaknesses.
There are many metrics like Accuracy, Precision, Recall, F1 score, Area
under Curve, Confusion Matrix, and Mean Square Error. Cross
Validation is one technique that is followed during the training phase
and it is a model evaluation technique as well.
Cross Validation and Holdout
Cross Validation is a method in which we do not use the whole dataset
for training. In this technique, some part of the dataset is reserved for
testing the model. There are many types of Cross-Validation out of
which K Fold Cross Validation is mostly used. In K Fold Cross
Validation the original dataset is divided into k subsets. The subsets are
known as folds. This is repeated k times where 1 fold is used for testing
purposes. Rest k-1 folds are used for training the model. So each data
point acts as a test subject for the model as well as acts as the training
subject. It is seen that this technique generalizes the model well and
reduces the error rate
Key Steps:
o Use validation and test datasets.
o Evaluate using appropriate metrics (e.g., accuracy, precision, recall, RMSE).
o Perform cross-validation for robustness.
9. Hyperparameter Tuning
Objective: Ensure the model remains accurate and relevant over time.
Key Steps:
o Monitor for data drift or concept drift.
o Retrain the model with updated data as needed.
o Log errors and feedback for continuous improvement.
12. Model Improvement and Iteration
9. Data Acquisition:
Data acquisition refers to the process of gathering data from various sources, such as sensors,
databases, or web scraping. The process of collecting and storing data for
machine learning from a variety of sources is known as data acquisition
(DAQ).
The procedure entails gathering, examining, and using crucial data to guarantee
precise measurements, instantaneous observation, and knowledgeable decision-
making. Sensors, measuring devices, and a computer work together in DAQ
systems to transform physical parameters into electrical signals, condition and
amplify those signals, and then store them for analysis.
Feature Engineering is a crucial step in the machine learning pipeline. It involves creating,
transforming, or selecting features (variables) from raw data to improve the performance and
accuracy of a machine learning model. The goal is to provide the model with the most
relevant and informative features.
a. Feature Selection
Selecting the most important features and removing irrelevant or redundant ones.
Techniques:
o Filter Methods: Correlation, mutual information, chi-square.
o Wrapper Methods: Recursive Feature Elimination (RFE).
o Embedded Methods: Lasso regularization, decision tree feature importance.
b. Feature Transformation
Transforming raw features into a format that makes them more usable.
Techniques:
o Scaling: Normalize or standardize features (e.g., Min-Max Scaling, Z-Score
Normalization).
o Encoding: Convert categorical variables into numerical (e.g., One-Hot Encoding,
Label Encoding).
o Log Transformation: Reduces skewness in data.
o Polynomial Features: Adds interaction terms or higher-order terms for non-linear
relationships.
c. Feature Creation
d. Feature Encoding
e. Dimensionality Reduction
Python Libraries:
o scikit-learn: Feature selection and preprocessing tools.
o pandas: Data manipulation and creation of new features.
o NumPy: Mathematical operations for feature transformations.
o Featuretools: Automated feature engineering.
R Libraries: dplyr, caret.
a. Numerical Features
b. Categorical Features
One-Hot Encoding: Convert "City" with values [New York, London, Paris] into binary
columns.
Frequency Encoding: Replace categories with their occurrence frequencies.
d. Text Data
Definition: Data representation refers to how data is structured and stored so that
machine learning models can interpret it effectively.
Types:
o Numerical Representation: Using numbers to represent data, like integers or
floating-point numbers.
o Categorical Representation: Using categories or labels, often transformed
into numerical form using techniques like one-hot encoding.
o Vector Representation: Representing data in vectorized form, often used in
natural language processing or image processing.
Data Representation in Machine Learning (ML) refers to how raw data is transformed,
structured, and encoded into a format that machine learning algorithms can process. Choosing
the right representation of data is critical to building effective models.
a. Numerical Representation
b. Categorical Representation
c. Text Representation
d. Time-Series Representation
f. Graph Representation
g. Audio Representation
Model Selection in Machine Learning (ML) is the process of choosing the most
appropriate algorithm or model for a given problem. The goal is to identify the
model that best balances bias and variance to make accurate predictions. The
choice of model depends on various factors, including the type of data, the
problem at hand, the computational resources, and the performance
requirements.
Problem formulation: Clearly express the issue at hand, including the kind of
predictions or task that you'd like the model to carry out (for example,
classification, regression, or clustering).
Candidate model selection: Pick a group of models that are appropriate for
the issue at hand. These models can include straightforward methods like
decision trees or linear regression as well as more sophisticated ones like deep
neural networks, random forests, or support vector machines.
Performance evaluation: Establish measures for measuring how well each
model performs. Common measurements include area under the receiver's
operating characteristic curve (AUC-ROC), recall, F1-score, mean squared
error, and accuracy, precision, and recall. The type of problem and the
particular requirements will determine which metrics are used.
Training and evaluation: Each candidate model should be trained using a
subset of the available data (the training set), and its performance should be
assessed using a different subset (the validation set or via cross-validation). The
established evaluation measures are used to gauge the model's effectiveness.
Model comparison: Evaluate the performance of various models and
determine which one performs best on the validation set. Take into account
elements like data handling capabilities, interpretability, computational
difficulty, and accuracy.
Hyperparameter tuning: Before training, many models require that certain
hyperparameters, such as the learning rate, regularisation strength, or the
number of layers that are hidden in a neural network, be configured. Use
methods like grid search, random search, and Bayesian optimization to identify
these hyperparameters' ideal values.
Final model selection: After the models have been analyzed and fine-tuned,
pick the model that performs the best. Then, this model can be used to make
predictions based on fresh, unforeseen data.
3. Model Evaluation
a. Classification Metrics
b. Regression Metrics
b. Consider Data Type and Size: Select models based on data type (e.g.,
numerical, categorical) and volume (small vs. large datasets).
c. Evaluate Simpler Models First: Start with simpler models (e.g., linear
regression, decision trees) to set a baseline.
d. Experiment with Complex Models: Try more sophisticated models (e.g.,
neural networks, gradient boosting) if simpler models do not perform well.
e. Hyperparameter Tuning: Use techniques like Grid Search, Random Search,
or Bayesian Optimization to fine-tune hyperparameters.
f. Cross-Validation: Use k-fold cross-validation to avoid overfitting and get a
more reliable estimate of model performance.
Definition: Model learning refers to the process of training a machine learning model
by feeding it data so that it can learn patterns or relationships in the data.
Model learning in machine learning (ML) refers to the process by which a model is
trained to make predictions or take actions based on data. Here's a breakdown of the
model learning process:
Types of Model Learning
1. Supervised Learning: The model learns from labeled data, where the correct output is
provided for each input.
2. Unsupervised Learning: The model learns from unlabeled data, identifying patterns and
relationships on its own.
4. Reinforcement Learning: The model learns through trial and error, receiving rewards or
penalties for its actions.
4. Model Evaluation: Assessing the model's performance using metrics such as accuracy,
precision, and recall.
1. Overfitting: When a model is too complex and performs well on training data but poorly
on new data.
2. Underfitting: When a model is too simple and fails to capture the underlying patterns in the
data.
5. Early Stopping: Stopping the training process when the model's performance on the
validation set starts to degrade.
Popular Machine Learning Algorithms
4. Support Vector Machines (SVMs): A linear or non-linear model for classification and
regression tasks.
5. Neural Networks: A complex model inspired by the human brain, used for a wide range of
tasks.
Accuracy = (TP+TN)/(TP+TN+FP+FN)
Output:
Accuracy: 0.9333333333333333
Output:
Precision: 0.9435897435897436
Recall: 0.9333333333333333
F1 score
The F1 score is the harmonic mean of precision and recall. It is seen that
during the precision-recall trade-off if we increase the precision, recall
decreases and vice versa. The goal of the F1 score is to combine
precision and recall.
F1 score = (2×Precision×Recall)/(Precision+Recall)
# calculating f1 score
average="weighted"))
Output:
F1 score: 0.9327777777777778
Confusion Matrix
A confusion matrix is an N x N matrix where N is the number of target
classes. It represents the number of actual outputs and the predicted
outputs. Some terminologies in the matrix are as follows:
True Positives: It is also known as TP. It is the output in which the
actual and the predicted values are YES.
True Negatives: It is also known as TN. It is the output in which the
actual and the predicted values are NO.
False Positives: It is also known as FP. It is the output in which the
actual value is NO but the predicted value is YES.
False Negatives: It is also known as FN. It is the output in which the
actual value is YES but the predicted value is NO.
Python3
confusion_matrix = metrics.confusion_matrix(y_test,
y_pred)
cm_display = metrics.ConfusionMatrixDisplay(
confusion_matrix=confusion_matrix,
display_labels=[0, 1, 2])
cm_display.plot()
plt.show()
OUTPUT:
Definition: Model prediction refers to the process of using a trained model to make
predictions on new, unseen data.
trained model to make predictions or forecasts on new, unseen data. Here's a breakdown of
the model prediction process:
1. Classification Prediction: The model predicts a categorical label or class for a new input.
2. Regression Prediction: The model predicts a continuous value or outcome for a new
input.
2. Model Input: Feed the preprocessed input data into the trained model.
3. Model Prediction: The model generates a prediction or forecast based on the input data.
2. Mean Squared Error (MSE): The average squared difference between predicted and
actual values for regression tasks.
3. Mean Absolute Error (MAE): The average absolute difference between predicted and
actual values for regression tasks.
4. Precision: The proportion of true positives among all positive predictions for classification
tasks.
5. Recall: The proportion of true positives among all actual positive instances for
classification tasks.
2. Early Stopping: Stopping the training process when the model's performance on the
validation set starts to degrade.
Search: Involves exploring a problem space to find the best solution. It can be applied
in algorithms like search trees or optimization tasks.
o Types:
Depth-first Search (DFS): Explores as far as possible down one branch before
backtracking.
Breadth-first Search (BFS): Explores all nodes at the present depth level
before moving on to nodes at the next depth level.
Search in machine learning refers to the process of finding the optimal solution
or configuration that maximizes the performance of a machine learning model.
Here are some key aspects of search in machine learning:
Types of Search
2. Model Search: Selecting the best machine learning model from a set of candidate models,
such as decision trees, random forests, or neural networks.
3. Feature Search: Identifying the most relevant features or variables that contribute to the
performance of a machine learning model.
Search Algorithms
Search Strategies
2. Heuristic Search: Using heuristics or rules of thumb to guide the search process.
Learning: Machine learning algorithms improve through experience, refining their models
and predictions over time.
A Dataset is a set of data grouped into a collection with which developers can
work to meet their goals. In a dataset, the rows represent the number of data
points and the columns represent the features of the Dataset. They are mostly
used in fields like machine learning, business, and government to gain insights,
make informed decisions, or train algorithms. Datasets may vary in size and
complexity and they mostly require cleaning and preprocessing to ensure data
quality and suitability for analysis or modeling.
Types of Datasets
There are various types of datasets available out there. They are:
Numerical Dataset: They include numerical data points that can be
solved with equations. These include temperature, humidity, marks and
so on.
Categorical Dataset: These include categories such as colour,
gender, occupation, games, sports and so on.
Web Dataset: These include datasets created by calling APIs using
HTTP requests and populating them with values for data analysis.
These are mostly stored in JSON (JavaScript Object Notation) formats.
Time series Dataset: These include datasets between a period, for
example, changes in geographical terrain over time.
Image Dataset: It includes a dataset consisting of images. This is
mostly used to differentiate the types of diseases, heart conditions and
so on.
Ordered Dataset: These datasets contain data that are ordered in
ranks, for example, customer reviews, movie ratings and so on.
Partitioned Dataset: These datasets have data points segregated into
different members or different partitions.
File-Based Datasets: These datasets are stored in files, in Excel
as .csv, or .xlsx files.
Bivariate Dataset: In this dataset, 2 classes or features are directly
correlated to each other. For example, height and weight in a dataset
are directly related to each other.
Multivariate Dataset: In these types of datasets, as the name
suggests 2 or more classes are directly correlated to each other. For
example, attendance, and assignment grades are directly correlated to
a student’s overall grade.
UNIT-2
Nearest Neighbor-Based Models in Machine Learning
Nearest neighbor-based models, especially the k-nearest neighbors (k-NN) algorithm, are a
fundamental class of machine learning techniques that rely on the idea of proximity or
similarity between data points for making predictions. These models are used in both
classification and regression tasks and are based on the idea that similar points are likely to
have similar labels or values. Below is a detailed breakdown of the key concepts related to
nearest neighbor-based models.
Nearest Neighbor methods rely on the idea that similar instances have similar outputs.
Predictions are based on the proximity of a query point to data points in the training set.
Common distance metrics:
o Euclidean Distance: d(x,x′)=∑i=1n(xi−xi′)2d(x, x') = \sqrt{\sum_{i=1}^n (x_i -
′)=∑i=1n∣xi−xi′∣
o Cosine Similarity: Measures the cosine of the angle between two vectors.
K-NN is a simple, instance-based algorithm that classifies a data point based on the majority
label of its kkk-nearest neighbors.
Algorithm:
1. Choose a value for kkk (number of neighbors).
2. Compute the distances between the query point and all points in the training set.
3. Identify the kkk closest points.
4. Assign the class label most common among these kkk points.
Key Considerations:
1. Larger kkk smoothens the decision boundary (reduces variance but may increase
bias).
2. Small kkk may lead to overfitting.
Predicts the value of a query point as the average (or weighted average) of the values of its
kkk-nearest neighbors: y^=1k∑i=1kyi\hat{y} = \frac{1}{k} \sum_{i=1}^k y_iy^=k1i=1∑kyi
o Weighted K-NN: Weights closer points more heavily:
y^=∑i=1kwiyi∑i=1kwi,wi=1d(x,xi)\hat{y} = \frac{\sum_{i=1}^k w_i y_i}{\sum_{i=1}^k
w_i}, \quad w_i = \frac{1}{d(x, x_i)}y^=∑i=1kwi∑i=1kwiyi,wi=d(x,xi)1
4. Distance Metrics
Strengths:
Weaknesses:
High Computation Cost: Requires computing distances to all training points at inference.
Memory Intensive: Stores the entire training set.
Curse of Dimensionality: Performance deteriorates in high-dimensional spaces.
Sensitive to Noise: Outliers can affect predictions.
7. Improving K-NN
Scaling Features:
Techniques like PCA (Principal Component Analysis) reduce noise and improve performance.
Proximity measures refer to techniques that quantify the similarity or closeness between two
data points. The idea is that if two data points are close to each other in the feature space,
they are more likely to belong to the same class (in classification) or have similar values (in
regression). These proximity measures are essential for nearest neighbor algorithms.
Distance Measures
Distance measures are specific types of proximity measures that calculate the physical
distance between data points in a multi-dimensional feature space. The choice of distance
measure can have a significant impact on the performance of the nearest neighbor algorithm.
1. Euclidean Distance:
o Formula: d(x,y)=∑i=1n(xi−yi)2d(\mathbf{x}, \mathbf{y}) = \sqrt{\
sum_{i=1}^{n} (x_i - y_i)^2}d(x,y)=i=1∑n(xi−yi)2
o The most common and widely used distance measure.
o It calculates the straight-line distance between two points in Euclidean space.
2. Manhattan Distance (also known as L1 distance or Taxicab Distance):
o Formula: d(x,y)=∑i=1n∣xi−yi∣d(\mathbf{x}, \mathbf{y}) = \sum_{i=1}^{n} |
x_i - y_i|d(x,y)=i=1∑n∣xi−yi∣
o The sum of the absolute differences between corresponding components.
3. Minkowski Distance:
o A generalization of Euclidean and Manhattan distance.
o Formula: d(x,y)=(∑i=1n∣xi−yi∣p)1/pd(\mathbf{x}, \mathbf{y}) = \left( \
sum_{i=1}^{n} |x_i - y_i|^p \right)^{1/p}d(x,y)=(i=1∑n∣xi−yi∣p)1/p
o When p=2p = 2p=2, it’s Euclidean distance; when p=1p = 1p=1, it’s
Manhattan distance.
4. Cosine Similarity:
o A measure of similarity between two non-zero vectors of an inner product
space.
o Formula: cosine similarity=x⋅y∥x∥∥y∥\text{cosine similarity} = \frac{\
mathbf{x} \cdot \mathbf{y}}{\|\mathbf{x}\| \|\
mathbf{y}\|}cosine similarity=∥x∥∥y∥x⋅y
o It measures the angle between the vectors, often used in text analysis or when
comparing documents.
5. Hamming Distance:
o Used for categorical data, particularly binary data.
o It counts the number of positions at which the corresponding elements of two
strings are different.
For binary patterns (data with values of 0 or 1), the distance or similarity can be computed
using measures that specifically handle binary data:
Hamming Distance: The number of differing bits between two binary patterns.
Jaccard Similarity: Works well for comparing binary sets, focusing on the
intersection over the union of two sets.
The k-Nearest Neighbors (k-NN) algorithm is one of the most straightforward and widely
used classification algorithms. It classifies new data points based on the majority class among
its kkk-nearest neighbors.
Algorithm Steps
Input:
Procedure:
1. Compute Distances:
o Calculate the distance d(xq,xi)d(x_q, x_i)d(xq,xi) between the query point xqx_qxq
and each data point xix_ixi in the training set.
2. Sort Neighbors:
o Sort the training points by their distance to xqx_qxq.
3. Select Nearest Neighbors:
o Select the kkk-nearest neighbors (smallest distances).
4. Majority Vote:
o Determine the majority class label among the kkk-nearest neighbors.
5. Output:
o Assign the query point xqx_qxq to the majority class.
3. Distance Metrics
4. Choosing kkk
Small kkk:
o Captures local patterns but may overfit (sensitive to noise).
Large kkk:
o Smoothens predictions but may underfit (ignores local structures).
Optimal kkk is usually determined via cross-validation.
Example
Dataset:
Training data:
D={(1,A),(2,A),(4,B),(5,B),(6,A)}D = \{(1, A), (2, A), (4, B), (5, B), (6, A)\}D={(1,A),(2,A),(4,B),(5,B),(6,A)}
Steps:
1. Compute Distances:
o d(1,3)=2,d(2,3)=1,d(4,3)=1,d(5,3)=2,d(6,3)=3d(1, 3) = 2, d(2, 3) = 1, d(4, 3) = 1, d(5, 3)
= 2, d(6, 3) = 3d(1,3)=2,d(2,3)=1,d(4,3)=1,d(5,3)=2,d(6,3)=3.
2. Sort Neighbors:
o Sorted distances: (2,A),(4,B),(1,A),(5,B),(6,A)(2, A), (4, B), (1, A), (5, B), (6, A)(2,A),
(4,B),(1,A),(5,B),(6,A).
3. Select k=3k = 3k=3 Nearest Neighbors:
o Neighbors: (2,A),(4,B),(1,A)(2, A), (4, B), (1, A)(2,A),(4,B),(1,A).
4. Majority Vote:
o Class labels: A,B,AA, B, AA,B,A.
o Majority class: AAA.
5. Prediction:
o Assign xq=3x_q = 3xq=3 to class AAA.
The Radius Nearest Neighbor (RNN) algorithm is an extension of the k-NN approach but
with a focus on the radius rather than the number of neighbors.
Algorithm Steps
Input:
Procedure:
1. Compute Distances:
o Calculate the distance d(xi,xq)d(x_i, x_q)d(xi,xq) between the query point xqx_qxq
and each data point xix_ixi in the training set.
2. Identify Neighbors:
o Select all points xix_ixi such that d(xi,xq)≤rd(x_i, x_q) \leq rd(xi,xq)≤r.
3. Prediction:
o Classification:
Use a majority vote among the labels of the selected neighbors.
If no neighbors are found within rrr, a default class can be assigned, or a
larger radius can be used.
o Regression:
Compute the mean (or weighted mean) of the target values yiy_iyi of the
selected neighbors.
Output:
3. Pseudocode
plaintext
Copy code
Algorithm RDNN:
Input: Training data D = {(x1, y1), (x2, y2), ..., (xn, yn)}, query point
x_q, radius r, distance metric d.
KNN Regression
In k-Nearest Neighbors Regression (k-NN regression), the output is continuous rather than
categorical. The prediction is made based on the average (or weighted average) of the target
values of the kkk-nearest neighbors.
Procedure:
1. Compute Distances:
o Calculate the distance d(xq,xi)d(x_q, x_i)d(xq,xi) between the query point xqx_qxq
and each training point xix_ixi.
2. Find kkk-Nearest Neighbors:
o Identify the kkk-nearest neighbors by selecting the kkk points with the smallest
distances to xqx_qxq.
3. Aggregate Neighbor Target Values:
o Simple Average:
Predict the target value as the mean of the target values of the kkk-nearest
neighbors: y^q=1k∑i=1kyi\hat{y}_q = \frac{1}{k} \sum_{i=1}^k y_iy^q=k1
i=1∑kyi
o Weighted Average:
Assign weights inversely proportional to the distances:
y^q=∑i=1kwiyi∑i=1kwi,wi=1d(xq,xi)+ϵ\hat{y}_q = \frac{\sum_{i=1}^k w_i y_i}
{\sum_{i=1}^k w_i}, \quad w_i = \frac{1}{d(x_q, x_i) + \epsilon}y^q=∑i=1kwi
∑i=1kwiyi,wi=d(xq,xi)+ϵ1 Here, ϵ\epsilonϵ is a small constant to prevent
division by zero.
Output:
Pseudocode
plaintext
Copy code
Algorithm KNN-Regression:
Input: Training data D = {(x1, y1), ..., (xn, yn)}, query point x_q, number
of neighbors k, distance metric d.
1. Compute distances:
For each (x_i, y_i) in D:
dist[i] = d(x_i, x_q)
1. Choice of kkk: Too small a kkk can lead to noisy predictions (overfitting), while too
large a kkk can make the model too smooth (underfitting).
2. Distance Measure: The choice of distance measure can heavily affect the accuracy of
the classifier, especially when features are on different scales.
3. Dimensionality: As the number of features (dimensions) increases, the model can
struggle due to the curse of dimensionality.
Evaluating the performance of regression algorithms is essential to understand how well the model
predicts continuous target variables. The choice of metric depends on the problem's context and the
specific goals, such as minimizing large errors, understanding variability, or achieving interpretable
results.
Measures the average magnitude of absolute errors between predicted values (y^i\
hat{y}_iy^i) and actual values (yiy_iyi).
Formula: MAE=1n∑i=1n∣yi−y^i∣\text{MAE} = \frac{1}{n} \sum_{i=1}^n |y_i - \hat{y}_i|
MAE=n1i=1∑n∣yi−y^i∣
Advantages:
o Simple to interpret.
o Less sensitive to outliers than Mean Squared Error.
Disadvantages:
o Doesn't penalize large errors as heavily as other metrics.
Measures the average of the squared differences between predicted and actual values.
Formula: MSE=1n∑i=1n(yi−y^i)2\text{MSE} = \frac{1}{n} \sum_{i=1}^n (y_i - \
hat{y}_i)^2MSE=n1i=1∑n(yi−y^i)2
Advantages:
o Penalizes large errors more than small ones.
Disadvantages:
o Can be heavily influenced by outliers.
o Units are squared, making interpretation less intuitive.
The square root of MSE; expresses error in the same units as the target variable.
Formula: RMSE=1n∑i=1n(yi−y^i)2\text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^n (y_i - \
hat{y}_i)^2}RMSE=n1i=1∑n(yi−y^i)2
Advantages:
o Penalizes large errors more than MAE.
o Easier to interpret than MSE.
Disadvantages:
o Still sensitive to outliers.
Measures the proportion of variance in the target variable explained by the model.
Formula: R2=1−∑i=1n(yi−y^i)2∑i=1n(yi−yˉ)2R^2 = 1 - \frac{\sum_{i=1}^n (y_i - \hat{y}_i)^2}{\
sum_{i=1}^n (y_i - \bar{y})^2}R2=1−∑i=1n(yi−yˉ)2∑i=1n(yi−y^i)2 Where yˉ\bar{y}yˉ is the
mean of the actual values.
Range:
o R2=1R^2 = 1R2=1: Perfect fit.
o R2=0R^2 = 0R2=0: Model predicts as well as the mean.
o R2<0R^2 < 0R2<0: Model performs worse than the mean prediction.
Advantages:
o Intuitive measure of goodness-of-fit.
1. .
UNIT-3
Models Based on Decision Trees
Decision Trees are supervised learning models used for classification and regression tasks.
Working Principle:
o They split the data into subsets based on the feature that provides the most
homogeneous subsets (purity of classes).
o Each internal node represents a test on a feature, branches represent the outcome
of the test, and leaves represent class labels.
2. Impurity Measures
Impurity measures quantify how "mixed" the data is in terms of target classes:
Where pip_ipi is the probability of class iii. Lower Gini indicates purer nodes.
Advantages:
o Easy to interpret and visualize.
o Handles both numerical and categorical data.
o Non-parametric (does not assume a distribution of the data).
Disadvantages:
o Prone to overfitting, especially with deep trees.
o Sensitive to small changes in data (instability).
5. Bias–Variance Trade-off
The Bayes classifier is based on Bayes' theorem and predicts the class that maximizes the
posterior probability: Class=argmaxkP(Ck∣X)\text{Class} = \arg\max_{k} P(C_k \mid
X)Class=argkmaxP(Ck∣X)
The Bayes classifier is optimal in minimizing the classification error if the true probabilities
P(Ck∣X)P(C_k \mid X)P(Ck∣X) are known.
In practice, these probabilities are often estimated from the data.
4. Multi-Class Classification
Extends the Bayes classifier to handle multiple classes by computing posterior probabilities
for each class and choosing the one with the highest value.
Class Conditional Independence: Assumes that features X1,X2,…,XnX_1, X_2, \ldots, X_nX1
,X2,…,Xn are independent given the class CCC: P(X∣C)=∏i=1nP(Xi∣C)P(X \mid C) = \
prod_{i=1}^n P(X_i \mid C)P(X∣C)=i=1∏nP(Xi∣C)
Naive Bayes Classifier:
o Based on the independence assumption, it calculates posterior probabilities
efficiently.
o Works well for text classification and spam filtering, despite its simplicity.
o Advantages: Fast, scalable, and effective with high-dimensional data.
o Disadvantages: Assumption of feature independence may not hold in practice.
UNIT-4
1. Introduction to Linear Discriminants
3. Perceptron Classifier
Goal: Adjust weights www and bias bbb to correctly classify all training data.
Algorithm:
1. Initialize w=0w = 0w=0, b=0b = 0b=0.
2. For each misclassified point xix_ixi:
Update www: w=w+ηyixiw = w + \eta y_i x_iw=w+ηyixi
Update bbb: b=b+ηyib = b + \eta y_ib=b+ηyi
3. Repeat until no misclassifications. η\etaη is the learning rate.
SVMs find the hyperplane that maximizes the margin (distance between the
hyperplane and the nearest points of each class).
Objective: Minimize 12∥w∥2subject to yi(wTxi+b)≥1∀i\text{Minimize } \
frac{1}{2} \| w \|^2 \quad \text{subject to } y_i(w^T x_i + b) \geq 1 \quad \
forall iMinimize 21∥w∥2subject to yi(wTxi+b)≥1∀i
Results in better generalization compared to perceptrons.
7. Non-Linear SVM
9. Logistic Regression
Logistic regression models the probability of a binary outcome using the sigmoid
function: P(y=1∣x)=11+exp(−wTx−b)P(y = 1 \mid x) = \frac{1}{1 + \exp(-w^T x -
b)}P(y=1∣x)=1+exp(−wTx−b)1
Decision boundary: wTx+b=0w^T x + b = 0wTx+b=0.
Optimized using maximum likelihood estimation.
For Example, In the graph given below, we can clearly see that there are 3
circular clusters forming on the basis of distance.
Now it is not necessary that the clusters formed must be circular in shape. The
shape of clusters can be arbitrary. There are many algortihms that work well with
detecting arbitrary shaped clusters.
For example, In the below given graph we can see that the clusters formed are not
circular in shape.
Types of Clustering Algorithms
At the surface level, clustering helps in the analysis of unstructured data. Graphing, the
shortest distance, and the density of the data points are a few of the elements that influence
cluster formation. Clustering is the process of determining how related the objects are based
on a metric called the similarity measure. Similarity metrics are easier to locate in smaller
sets of features. It gets harder to create similarity measures as the number of features
increases. Depending on the type of clustering algorithm being utilized in data mining,
several techniques are employed to group the data from the datasets. In this part, the
clustering techniques are described. Various types of clustering algorithms are:
1. Centroid-based Clustering (Partitioning methods)
2. Density-based Clustering (Model-based methods)
3. Connectivity-based Clustering (Hierarchical clustering)
4. Distribution-based Clustering
We will be going through each of these types in brief.
1. Centroid-based Clustering (Partitioning methods)
Partitioning methods are the most easiest clustering algorithms. They group data points on
the basis of their closeness. Generally, the similarity measure chosen for these algorithms
are Euclidian distance, Manhattan Distance or Minkowski Distance. The datasets are
separated into a predetermined number of clusters, and each cluster is referenced by a
vector of values. When compared to the vector value, the input data variable shows no
difference and joins the cluster.
The primary drawback for these algorithms is the requirement that we establish the number
of clusters, “k,” either intuitively or scientifically (using the Elbow Method) before any
clustering machine learning system starts allocating the data points. Despite this, it is still
the most popular type of clustering. K-means and K-medoids clustering are some examples
of this type clustering.
2. Density-based Clustering (Model-based methods)
Density-based clustering, a model-based method, finds groups based on the density of data
points. Contrary to centroid-based clustering, which requires that the number of clusters be
predefined and is sensitive to initialization, density-based clustering determines the number
of clusters automatically and is less susceptible to beginning positions. They are great at
handling clusters of different sizes and forms, making them ideally suited for datasets with
irregularly shaped or overlapping clusters. These methods manage both dense and sparse
data regions by focusing on local density and can distinguish clusters with a variety of
morphologies.
2. Partitioning of Data
Output:
A dataset of K clusters
Method:
1. Randomly assign K objects from the dataset(D) as cluster centres(C)
2. (Re) Assign each object to which object is most similar based upon mean
values.
3. Update Cluster means, i.e., Recalculate the mean of each cluster with the
updated values.
4. Repeat Step 2 until no change occurs.
Figure – K-mean Clustering
Example: Suppose we want to group the visitors to a website using just their age
as follows:
16, 16, 17, 20, 20, 21, 21, 22, 23, 29, 36, 41, 42, 43, 44, 45, 61, 62, 66
Initial Cluster:
K=2
Centroid(C1) = 16 [16]
Centroid(C2) = 22 [22]
Note: These two points are chosen randomly from the dataset.
Iteration-1:
C1 = 16.33 [16, 16, 17]
C2 = 37.25 [20, 20, 21, 21, 22, 23, 29, 36, 41, 42, 43, 44, 45, 61, 62, 66]
Iteration-2:
C1 = 19.55 [16, 16, 17, 20, 20, 21, 21, 22, 23]
C2 = 46.90 [29, 36, 41, 42, 43, 44, 45, 61, 62, 66]
Iteration-3:
C1 = 20.50 [16, 16, 17, 20, 20, 21, 21, 22, 23, 29]
C2 = 48.89 [36, 41, 42, 43, 44, 45, 61, 62, 66]
Iteration-4:
C1 = 20.50 [16, 16, 17, 20, 20, 21, 21, 22, 23, 29]
C2 = 48.89 [36, 41, 42, 43, 44, 45, 61, 62, 66]
No change Between Iteration 3 and 4, so we stop. Therefore we get the
clusters (16-29) and (36-66) as 2 clusters we get using K Mean Algorithm.
3. Matrix Factorization
Matrix factorization is a simple embedding model. Given the feedback matrix A ∈Rm×n,
where m is the number of users (or queries) and n is the number of items, the model learns:
A user embedding matrix U∈Rm×d, where row i is the embedding for user i.
An item embedding matrix V∈Rn×d, where row j is the embedding for item j.
feedback matrix A. Observe that the (i,j) entry of U.VT is simply the dot product ⟨Ui,Vj⟩ of
The embeddings are learned such that the product UVT is a good approximation of the
the embeddings of user i and item j, which you want to be close to Ai,j.
Note: Matrix factorization typically gives a more compact representation than learning the
full matrix. The full matrix has O(nm) entries, while the embedding
matrices U, V have O((n+m)d) entries, where the embedding dimension d is typically much
smaller than m and n. As a result, matrix factorization finds latent structure in the data,
assuming that observations lie close to a low-dimensional subspace. In the preceding
example, the values of n, m, and d are so low that the advantage is negligible. In real-world
recommendation systems, however, matrix factorization can be significantly more compact
than learning the full matrix.
Choosing the objective function
One intuitive objective function is the squared distance. To do this, minimize the sum of
squared errors over all pairs of observed entries:
minU∈Rm×d, V∈Rn×d∑(i,j)∈obs(Aij−⟨Ui,Vj⟩)2.
In this objective function, you only sum over observed pairs (i, j), that is, over non-zero
values in the feedback matrix. However, only summing over values of one is not a good idea
—a matrix of all ones will have a minimal loss and produce a model that can't make effective
recommendations and that generalizes poorly.
In contrast, Weighted Matrix Factorization decomposes the objective into the following
two sums:
4. Clustering of Patterns
Identifies groups (clusters) in data where patterns within the same cluster are similar, and
patterns in different clusters are dissimilar.
Uses distance metrics like Euclidean, Manhattan, or cosine similarity.
5. Divisive Clustering
Divisive Clustering is the technique that starts with all data points in a single
cluster and recursively splits the clusters into smaller sub-clusters based on their
dissimilarity. It is also known as, “top-down” clustering. It starts with all data
points in a single cluster, and then recursively splits the clusters into smaller sub-
clusters based on their dissimilarity.
Unlike agglomerative clustering, which starts with each data point as its own
cluster and iteratively merges the most similar pairs of clusters, divisive
clustering is a “divide and conquer” approach that breaks a large cluster into
smaller sub-clusters
Example 1:
Python
np.random.seed(0)
x = np.random.randn(20, 2)
z = linkage(x, method='complete')
plt.figure(figsize=(10, 5))
plt.xlabel('Sample index')
plt.ylabel('Distance')
dendrogram(z)
plt.show()
Output:
Agglomerative Clustering
Note that there are other methods of linkage that can be used, such as single,
average, and ward. The method used will affect the resulting dendrogram and
clustering.
Example 2:
Python
import numpy as np
np.random.seed(0)
X = np.random.randn(20, 2)
# Plot dendrogram
plt.figure(figsize=(10, 5))
plt.xlabel('Sample index')
plt.ylabel('Distance')
dendrogram(Z)
plt.show()
Output:
Agglomerative clustering
The resulting plot will show the hierarchical clustering structure with each data
point as a leaf node in the dendrogram and the clusters at different levels of the
hierarchy.
6. Agglomerative Clustering
We will use the iris dataset for demonstration. The first step is to import the
necessary libraries and load the dataset.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage iris = load_iris()
X = iris.data
y = iris.target
The next step is to create a linkage matrix that contains the distances
between each pair of clusters. We can use the linkage function from the
scipy.cluster.hierarchy module to create the linkage matrix.
Z = linkage(X, 'ward')
The 'ward' method is used to calculate the distances between the clusters. It
minimizes the variance of the distances between the clusters being merged.
We can visualize the dendrogram using the dendrogram function from the
same module.
plt.figure(figsize=(7.5, 3.5))
plt.title("Iris Dendrogram") dendrogram(Z)
plt.show()
The resulting dendrogram (see the following plot) shows the hierarchical
relationship between the clusters. We can see that the algorithm has merged
the closest clusters first, and the distance between the clusters increases as
we move up the tree.
The final step is to apply the clustering algorithm and extract the cluster
labels. We can use the AgglomerativeClustering class from the sklearn.cluster
module to apply the algorithm.
model = AgglomerativeClustering(n_clusters=3)
model.fit(X) labels = model.labels_
The n_clusters parameter specifies the number of clusters to be extracted
from the data. In this case, we have specified n_clusters=3 because we know
that the iris dataset has three classes.
plt.figure(figsize=(7.5, 3.5))
plt.scatter(X[:, 0], X[:, 1], c=labels)
plt.xlabel("Sepal length")
plt.ylabel("Sepal width")
plt.title("Agglomerative Clustering Results
plt.show()
The resulting plot shows the three clusters identified by the algorithm. We
can see that the algorithm has successfully separated the data points into
their respective classes.
Example
Here is the complete implementation of Agglomerative Clustering in Python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt from sklearn.datasets
import load_iris from sklearn.cluster
import AgglomerativeClustering from scipy.cluster.hierarchy
import dendrogram, linkage # Load the Iris dataset iris =
load_iris() X = iris.data
y = iris.target
Z = linkage(X, 'ward')
# Plot the dendogram plt.figure(figsize=(7.5, 3.5))
plt.title("Iris Dendrogram")
dendrogram(Z)
plt.show() # create an instance of the AgglomerativeClustering
class model = AgglomerativeClustering(n_clusters=3) # fit the
model to the dataset model.fit(X) labels = model.labels_ # Plot
the results
plt.figure(figsize=(7.5, 3.5))
plt.scatter(X[:, 0], X[:, 1], c=labels)
plt.xlabel("Sepal length")
plt.ylabel("Sepal width")
plt.title("Agglomerative Clustering Results")
plt.show()
Advantages of Agglomerative Clustering
7. Partitional Clustering
Divides the dataset into non-overlapping clusters, with each data point
belonging to exactly one cluster.
Example algorithms: K-Means, K-Medoids.
Focus is on optimizing a predefined objective function (e.g., minimizing
intra-cluster distance).
K-Means Algorithm (A centroid based
Technique): It is one of the most commonly used algorithm for
partitioning a given data set into a set of k groups (i.e. k clusters), where k
represents the number of groups. It classifies objects in multiple groups
(i.e., clusters), such that objects within the same cluster are as similar as
possible (i.e., high intra-class similarity), whereas objects from different
clusters are as dissimilar as possible (i.e., low inter-class similarity). In k-
means clustering, each cluster is represented by its center (i.e, centroid)
which corresponds to the mean of points assigned to the cluster. The basic
idea behind k-means clustering consists of defining clusters so that the
total intra-cluster variation (known as total within-cluster variation) is
minimized.
Steps involved in K-Means Clustering :
The first step when using k-means clustering is to indicate the number of
clusters (k) that will be generated in the final solution.
The algorithm starts by randomly selecting k objects from the data set to
serve as the initial centers for the clusters. The selected objects are also
known as cluster means or centroids.
Next, each of the remaining objects is assigned to it’s closest centroid,
where closest is defined using the Euclidean distance between the object
and the cluster mean. This step is called “cluster assignment step”.
After the assignment step, the algorithm computes the new mean value of
each cluster. The term cluster “centroid update” is used to design this step.
Now that the centers have been recalculated, every observation is checked
again to see if it might be closer to a different cluster. All the objects are
reassigned again using the updated cluster means.
The cluster assignment and centroid update steps are iteratively repeated
until the cluster assignments stop changing (i.e until convergence is
achieved). That is, the clusters formed in the current iteration are the same
as those obtained in the previous iteration.
K-Medoids Algorithm (Partitioning Around Medoid) :
1. A medoid can be defined as the point in the cluster, whose similarities with
all the other points in the cluster is maximum.
2. In k-medoids clustering, each cluster is represented by one of the data point
in the cluster. These points are named cluster medoids. The term medoid
refers to an object within a cluster for which average dissimilarity between it
and all the other the members of the cluster is minimal. It corresponds to the
most centrally located point in the cluster.
3. These objects (one per cluster) can be considered as a representative
example of the members of that cluster which may be useful in some
situations. Recall that, in k-means clustering, the center of a given cluster is
calculated as the mean value of all the data points in the cluster.
4. K-medoid is a robust alternative to k-means clustering. This means that, the
algorithm is less sensitive to noise and outliers, compared to k-means,
because it uses medoids as cluster centers instead of means (used in k-
means).
Steps involved in K-Medoids Clustering :
3. Next, each selected medoid m and each non-medoid data point are swapped
and the objective function is computed. The objective function corresponds
to the sum of the dissimilarities of all objects to their nearest medoid.
he algorithm takes the unlabeled dataset as input, divides the dataset into k-
number of clusters, and repeats the process until it does not find the best
clusters. The value of k should be predetermined in this algorithm.
The below diagram explains the working of the K-means Clustering Algorithm:
How does the K-Means Algorithm Work?
The working of the K-Means algorithm is explained in the below steps:
Step-2: Select random K points or centroids. (It can be other from the input
dataset).
Step-3: Assign each data point to their closest centroid, which will form the
predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new
closest centroid of each cluster.
Suppose we have two variables M1 and M2. The x-y axis scatter plot of these
two variables is given below:
o Let's take number k of clusters, i.e., K=2, to identify the dataset and to put
them into different clusters. It means here we will try to group these
datasets into two different clusters.
o We need to choose some random k points or centroid to form the cluster.
These points can be either the points from the dataset or any other point.
So, here we are selecting the below two points as k points, which are not
the part of our dataset. Consider the below image:
o Now we will assign each data point of the scatter plot to its closest K-
point or centroid. We will compute it by applying some mathematics that
we have studied to calculate the distance between two points. So, we will
draw a median between both the centroids. Consider the below image:
From the above image, it is clear that points left side of the line is near to the K1
or blue centroid, and points to the right of the line are close to the yellow
centroid. Let's color them as blue and yellow for clear visualization.
o As we need to find the closest cluster, so we will repeat the process by
choosing a new centroid. To choose the new centroids, we will compute
the center of gravity of these centroids, and will find new centroids as
below:
o Next, we will reassign each datapoint to the new centroid. For this, we
will repeat the same process of finding a median line. The median will be
like below image:
From the above image, we can see, one yellow point is on the left side of the
line, and two blue points are right to the line. So, these three points will be
assigned to new centroids.
o As we got the new centroids so again will draw the median line and
reassign the data points. So, the image will be:
o We can see in the above image; there are no dissimilar data points on
either side of the line, which means our model is formed. Consider the
below image:
As our model is ready, so we can now remove the assumed centroids, and the
two final clusters will be as shown in the below image:
Elbow Method
The Elbow method is one of the most popular ways to find the optimal number
of clusters. This method uses the concept of WCSS value. WCSS stands
for Within Cluster Sum of Squares, which defines the total variations within a
cluster. The formula to calculate the value of WCSS (for 3 clusters) is given
below:
∑Pi in Cluster1 distance(Pi C1)2: It is the sum of the square of the distances between
each data point and its centroid within a cluster1 and the same for the other two
terms.
To measure the distance between data points and centroid, we can use any
method such as Euclidean distance or Manhattan distance.
To find the optimal value of clusters, the elbow method follows the below steps:
In the given dataset, we have Customer_Id, Gender, Age, Annual Income ($),
and Spending Score (which is the calculated value of how much a customer
has spent in the mall, the more the value, the more he has spent). From this
dataset, we need to calculate some patterns, as it is an unsupervised method, so
we don't know what to calculate exactly.
o Data Pre-processing
o Finding the optimal number of clusters using the elbow method
o Training the K-means algorithm on the training dataset
o Visualizing the clusters
The first step will be the data pre-processing, as we did in our earlier topics of
Regression and Classification. But for the clustering problem, it will be
different from other models. Let's discuss it:
o Importing Libraries
As we did in previous topics, firstly, we will import the libraries for our
model, which is part of data pre-processing. The code is given below:
1. # importing libraries
2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd
In the above code, the numpy we have imported for the performing
mathematics calculation, matplotlib is for plotting the graph, and pandas are
for managing the dataset.
Step-2: Finding the optimal number of clusters using the elbow method
In the second step, we will try to find the optimal number of clusters for our
clustering problem. So, as discussed above, here we are going to use the elbow
method for this purpose.
As we know, the elbow method uses the WCSS concept to draw the plot by
plotting WCSS values on the Y-axis and the number of clusters on the X-axis.
So we are going to calculate the value for WCSS for different k values ranging
from 1 to 10. Below is the code for it:
1. #finding optimal number of clusters using the elbow method
2. from sklearn.cluster import KMeans
3. wcss_list= [] #Initializing the list for the values of WCSS
4.
5. #Using for loop for iterations from 1 to 10.
6. for i in range(1, 11):
7. kmeans = KMeans(n_clusters=i, init='k-means++', random_state= 42)
8. kmeans.fit(x)
9. wcss_list.append(kmeans.inertia_)
10.mtp.plot(range(1, 11), wcss_list)
11.mtp.title('The Elobw Method Graph')
12.mtp.xlabel('Number of clusters(k)')
13.mtp.ylabel('wcss_list')
14.mtp.show()
As we can see in the above code, we have used the KMeans class of sklearn.
cluster library to form the clusters.
Next, we have created the wcss_list variable to initialize an empty list, which is
used to contain the value of wcss computed for different values of k ranging
from 1 to 10.
After that, we have initialized the for loop for the iteration on a different value
of k ranging from 1 to 10; since for loop in Python, exclude the outbound limit,
so it is taken as 11 to include 10th value.
The rest part of the code is similar as we did in earlier topics, as we have fitted
the model on a matrix of features and then plotted the graph between the
number of clusters and WCSS.
Output: After executing the above code, we will get the below output:
From the above plot, we can see the elbow point is at 5. So the number of
clusters here will be 5.
Step- 3: Training the K-means algorithm on the training dataset
As we have got the number of clusters, so we can now train the model on the
dataset.
To train the model, we will use the same two lines of code as we have used in
the above section, but here instead of using i, we will use 5, as we know there
are 5 clusters that need to be formed. The code is given below:
In the second line of code, we have created the dependent variable y_predict to
train the model.
By executing the above lines of code, we will get the y_predict variable. We can
check it under the variable explorer option in the Spyder IDE. We can now
compare the values of y_predict with our original dataset. Consider the below
image:
From the above image, we can now relate that the CustomerID 1 belongs to a
cluster
3(as index starts from 0, hence 2 will be considered as 3), and 2 belongs to
cluster 4, and so on.
The last step is to visualize the clusters. As we have 5 clusters for our model, so
we will visualize each cluster one by one.
To visualize the clusters will use scatter plot using mtp.scatter() function of
matplotlib.
Output:
The output image is clearly showing the five different clusters with different
colors. The clusters are formed between two parameters of the dataset; Annual
income of customer and Spending. We can change the colors and labels as per
the requirement or choice. We can also observe some points from the above
patterns, which are given below:
o Cluster1 shows the customers with average salary and average spending
so we can categorize these customers as
o Cluster2 shows the customer has a high income but low spending, so we
can categorize them as careful.
o Cluster3 shows the low income and also low spending so they can be
categorized as sensible.
o Cluster4 shows the customers with low income with very high spending
so they can be categorized as careless.
o Cluster5 shows the customers with high income and high spending so
they can be categorized as target, and these customers can be the most
profitable customers for the mall owner.
In NLP, hard clustering and soft clustering refer to different methods of grouping
text data based on similarity.
Hard clustering, also known as crisp clustering or traditional clustering, is a
technique that assigns each data point, in this case, a document or word, to
exactly one cluster. The assignment is based on a similarity measure, such as
cosine similarity or Euclidean distance, which calculates the similarity between
the data points. Hard clustering algorithms like k-means or hierarchical
clustering partition the data into distinct, non-overlapping clusters. Each
document or word belongs exclusively to a single cluster, and there is no
ambiguity in the assignment.
On the other hand, soft clustering, also called fuzzy clustering or probabilistic
clustering, allows for more flexible assignments. Instead of assigning a data
point to a single cluster, soft clustering assigns a probability or degree of
membership to each cluster. This means that a document or word can belong to
multiple clusters simultaneously, with varying degrees of membership. Soft
clustering algorithms, such as fuzzy c-means or Gaussian mixture models,
calculate the likelihood of a data point belonging to each cluster based on
probabilistic models.
Hard clustering: Assigns each data point to a single, exclusive cluster based on
similarity. There is no ambiguity in the assignment.
Soft clustering: Assigns a degree of membership or probability to each cluster
for a data point. Allows for overlapping membership in multiple clusters.
Both hard and soft clustering approaches have their benefits and applications in
NLP.
Soft clustering, on the other hand, provides more nuanced results, allowing for
overlapping clusters and capturing uncertainty in the data. It can be helpful when
documents or words can have multiple interpretations or belong to multiple
topics simultaneously.
One of the most commonly used methods for hard clustering is the k-means
algorithm. The k-means algorithm is an iterative clustering algorithm that aims
to partition a given dataset into k distinct clusters.
The k-means algorithm is widely used due to its simplicity and efficiency.
However, it is important to keep in mind that the algorithm’s performance can be
sensitive to the initial centroid selection, and it assumes that clusters have a
spherical shape and similar sizes. Various techniques, such as multiple
initializations and more advanced initialization methods like k-means++, can be
employed to mitigate these limitations.
k-means algorithm:
3. Update: Recalculate the centroid of each cluster by taking the mean of all
data points assigned to that cluster.
4. Repeat Steps 2 and 3 until convergence: Iterate the assignment and update
steps until the centroids stabilize or a predefined stopping criterion is met.
Convergence is typically assessed based on the change in the centroids’
positions or the overall objective function.
5. Final Result: Once the algorithm converges, each data point will be
assigned to a single cluster, resulting in a hard clustering solution.
Other commonly used hard clustering methods include hierarchical clustering
algorithms (e.g., agglomerative clustering, divisive clustering) that build a
hierarchical structure of clusters, as well as density-based clustering algorithms
like DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
that can discover clusters of arbitrary shapes based on density connectivity.
One of the commonly used methods for soft clustering is the fuzzy c-means
(FCM) algorithm. FCM is a popular soft clustering algorithm that allows for
probabilistic or fuzzy assignments of data points to clusters.
1. Initialization: Select the number of clusters (k) and initialize the cluster
centers and membership degrees randomly or using a specific
initialization method.
5. Final Result: Once the algorithm converges, each data point will have
membership degrees across all clusters, indicating the degree of its
association with each cluster. This provides a soft clustering solution with
data points having varying degrees of membership to multiple clusters.
The fuzzy c-means algorithm allows for more flexible assignments compared to
hard clustering algorithms. It captures the uncertainty or fuzziness in data point
assignments and provides a probabilistic representation of cluster membership.
The algorithm can accommodate overlapping clusters and is suitable for
scenarios where data points may have ambiguous or mixed characteristics.
Apart from fuzzy c-means, other soft clustering methods include Gaussian
mixture models (GMM), where data points are modeled as a mixture of
Gaussian distributions, and probabilistic latent semantic analysis (pLSA) or
latent Dirichlet allocation (LDA) for document clustering, which use
probabilistic models to assign documents to multiple topics with different
probabilities.
Let us compare these two powerful algorithms to get a clear idea of where the
fuzzy c-means algorithm fits in.
We should realize that fuzzy c-means is a special case of K-means when the
probability function used is simply 1 if the data point is closest to a centroid and
0 otherwise.
Most of the measures used to evaluate the clusters under other clustering
algorithms can be used for Fuzzy C-Means. Even though these methods rely on
the subject matter of the expert, I am listing below some popular measures used
to evaluate the clusters so formed:
Pros
1. Gives best result for overlapped data set and comparatively better then k-
means algorithm.
2. Unlike k-means where data point must exclusively belong to one cluster
center here data point is assigned membership to each cluster center as a
result of which data point may belong to more then one cluster center.
Cons
Rough Clustering is a clustering technique in machine learning that integrates the concepts
of rough set theory into traditional clustering methods. Unlike conventional clustering
approaches, where an object strictly belongs to one cluster, rough clustering allows for
uncertainty and ambiguity by associating each object with two boundary regions: a lower
approximation and an upper approximation.
1. Lower Approximation:
o The set of objects that definitely belong to a cluster.
o These objects have a high certainty of membership based on similarity measures.
2. Upper Approximation:
o The set of objects that possibly belong to a cluster.
o These objects have some level of similarity with the cluster but could also belong to
other clusters.
3. Boundary Region:
o The difference between the upper and lower approximations.
o It contains objects whose cluster memberships are uncertain or overlap multiple
clusters.
Characteristics:
Handles Uncertainty: Rough clustering is particularly useful when dealing with noisy data or
overlapping clusters.
Dual Membership: Objects can partially belong to multiple clusters through the boundary
region.
Flexibility: By allowing overlapping memberships, rough clustering provides a more realistic
representation of data in scenarios where rigid partitions are unsuitable.
Applications:
1. Rough K-Means:
o An extension of the K-Means algorithm that incorporates rough set theory.
o Each data point is classified into lower and upper approximations based on its
distance from cluster centroids.
Advantages:
Limitations:
Rough clustering is especially valuable in domains where data boundaries are not well-
defined or when partial membership is an inherent characteristic of the dataset.
12. Rough K-Means Clustering Algorithm
1. Initialization:
Randomly select kkk initial centroids from the dataset. These centroids represent the
centers of the kkk clusters.
2. Assignment Step:
Assign each data point to the cluster whose centroid is closest (based on a distance
metric, typically Euclidean distance).
3. Update Step:
Compute the new centroid for each cluster by averaging the positions of all points
assigned to that cluster.
4. Repeat:
Alternate between the assignment and update steps until the centroids stabilize (i.e.,
the changes in centroids between iterations are below a threshold) or a maximum
number of iterations is reached.
5. Output:
The algorithm outputs the final cluster assignments and their corresponding centroids.
Rough Pseudocode
python
Copy code
function k_means(data, k, max_iters):
# Step 1: Initialize centroids randomly
centroids = initialize_random_centroids(data, k)
centroids = new_centroids
Key Features
1. Input Parameters:
o kkk: Number of clusters.
o Dataset: A set of nnn data points.
o Maximum iterations: To ensure the algorithm terminates.
2. Distance Metric:
Euclidean distance is the default, but other metrics like Manhattan or cosine distance
can also be used.
3. Stopping Criteria:
The algorithm stops when:
o Centroids do not change significantly between iterations.
o A maximum number of iterations is reached.
4. Output:
o Cluster assignments for all data points.
o Final centroid locations.
Limitations
Expectation-Maximization in EM Algorithm
EM Algorithm Flowchart
1. Initialization:
Initially, a set of initial values of the parameters are considered. A set of
incomplete observed data is given to the system with the assumption
that the observed data comes from a specific model.
2. E-Step (Expectation Step): In this step, we use the observed data in order
to estimate or guess the values of the missing or incomplete data. It is
basically used to update the variables.
import numpy as np
1
# Generate a dataset with two Gaussian components
2
mu1, sigma1 = 2, 1
3
X = np.concatenate([X1, X2])
7
sns.kdeplot(X)
10
plt.xlabel('X')
11
plt.ylabel('Density')
12
plt.show()
Output:
Density Plot
Initialize parameters
Python
1
# Initialize parameters
2
Perform EM algorithm
Iterates for the specified number of epochs (20 in this case).
In each epoch, the E-step calculates the responsibilities (gamma values) by
evaluating the Gaussian probability densities for each component and
weighting them by the corresponding proportions.
The M-step updates the parameters by computing the weighted mean and
standard deviation for each component
Python
num_epochs = 20
3
log_likelihoods = []
4
gamma1 /= total
11
gamma2 /= total
12
13
pi1_hat = np.mean(gamma1)
19
pi2_hat = np.mean(gamma2)
20
21
# Compute log-likelihood
22
log_likelihoods.append(log_likelihood)
25
26
plt.xlabel('Epoch')
29
plt.ylabel('Log-Likelihood')
30
plt.show()
Output:
Epoch vs Log-likelihood
X_sorted = np.sort(X)
3
density_estimation = pi1_hat*norm.pdf(X_sorted,
4
mu1_hat,
5
mu2_hat,
7
sigma2_hat)
8
9
10
plt.xlabel('X')
13
plt.ylabel('Density')
14
plt.show()
Output:
Estimated density
Building the Similarity Graph Of The Data: This step builds the Similarity
Graph in the form of an adjacency matrix which is represented by A. The
adjacency matrix can be built in the following manners:
Epsilon-neighbourhood Graph: A parameter epsilon is fixed beforehand.
Then, each point is connected to all the points which lie in its epsilon-
radius. If all the distances between any two points are similar in scale then
typically the weights of the edges ie the distance between the two points are
not stored since they do not provide any additional information. Thus, in
this case, the graph built is an undirected and unweighted graph.
K-Nearest Neighbours A parameter k is fixed beforehand. Then, for two
vertices u and v, an edge is directed from u to v only if v is among the k-
nearest neighbours of u. Note that this leads to the formation of a weighted
and directed graph because it is not always the case that for each u having v
as one of the k-nearest neighbours, it will be the same case for v having u
among its k-nearest neighbours. To make this graph undirected, one of the
following approaches is followed:-
1. Direct an edge from u to v and from v to u if either v is among the k-
nearest neighbours of u OR u is among the k-nearest neighbours of v.
2. Direct an edge from u to v and from v to u if v is among the k-nearest
neighbours of u AND u is among the k-nearest neighbours of v.
3. Fully-Connected Graph: To build this graph, each point is connected
with an undirected edge-weighted by the distance between the two
points to every other point. Since this approach is used to model the
local neighbourhood relationships thus typically the Gaussian similarity
metric is used to calculate the distance.
Projecting the data onto a lower Dimensional Space: This step is done to
account for the possibility that members of the same cluster may be far away
in the given dimensional space. Thus the dimensional space is reduced so that
those points are closer in the reduced dimensional space and thus can be
clustered together by a traditional clustering algorithm. It is done by
computing the Graph Laplacian Matrix.
Python3
import pandas as pd
cd "C:\Users\Dev\Desktop\Kaggle\Credit_Card"
# Loading the data
X = pd.read_csv('CC_GENERAL.csv')
X = X.drop('CUST_ID', axis = 1)
X.head()
Output:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Normalizing the Data
X_normalized = normalize(X_scaled)
X_normalized = pd.DataFrame(X_normalized)
pca = PCA(n_components = 2)
X_principal = pca.fit_transform(X_normalized)
X_principal = pd.DataFrame(X_principal)
X_principal.head()
labels_rbf = spectral_model_rbf.fit_predict(X_principal)
Python3
colours = {}
colours[0] = 'b'
colours[1] = 'y'
Output:
b) affinity = ‘nearest_neighbors’
Python3
labels_nn = spectral_model_nn.fit_predict(X_principal)
Output:
Step 5: Evaluating the performances
Python3
s_scores = []
s_scores.append(silhouette_score(X, labels_rbf))
s_scores.append(silhouette_score(X, labels_nn))
print(s_scores)
plt.bar(affinity, s_scores)
plt.xlabel('Affinity')
plt.ylabel('Silhouette Score')
plt.show()
Output:
Advantages
Disadvantages
Requires choosing parameters like the number of clusters kkk, the
similarity measure, and its parameters (e.g., σ\sigmaσ in Gaussian
similarity).
Computationally expensive for large datasets due to eigenvalue
decomposition.
Applications
Image segmentation.
Social network analysis.
Document clustering.
Bioinformatics (e.g., clustering genes or proteins).