0% found this document useful (0 votes)
180 views108 pages

Class Notes ML 1

The document provides an overview of machine learning (ML), its types, algorithms, applications, and challenges. It categorizes ML into supervised, unsupervised, reinforcement, and semi-supervised learning, detailing their methodologies and common algorithms. Additionally, it discusses the evolution of ML from its early foundations to current advancements in deep learning and big data.

Uploaded by

madhu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
180 views108 pages

Class Notes ML 1

The document provides an overview of machine learning (ML), its types, algorithms, applications, and challenges. It categorizes ML into supervised, unsupervised, reinforcement, and semi-supervised learning, detailing their methodologies and common algorithms. Additionally, it discusses the evolution of ML from its early foundations to current advancements in deep learning and big data.

Uploaded by

madhu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 108

MACHINE LEARNING M23

UNIT-1
Introduction to Machine Learning

Machine learning (ML) is a subset of artificial intelligence (AI) that focuses on enabling
systems to learn from data and make decisions or predictions without being explicitly
programmed. Rather than following predetermined rules, ML algorithms allow computers to
recognize patterns, make inferences, and improve their performance over time through
experience.

1. What is Machine Learning?

Machine learning refers to the field of study that gives computers the ability to learn from
data, identify patterns, and make decisions with minimal human intervention. This involves
using algorithms to analyze and model data to make predictions or decisions.

2. Types of Machine Learning

Machine learning can be categorized into three primary types:

1. Supervised Learning:
o In supervised learning, the model is trained on labeled data (data that has both
input and corresponding output). The algorithm learns by example and
generalizes patterns to make predictions on new, unseen data.
o Example: Classifying emails as "spam" or "not spam."
2. Unsupervised Learning:
o In unsupervised learning, the model is given data without labels. The goal is to
find hidden patterns or structures within the data.
o Example: Customer segmentation, where a model groups customers based on
their purchasing behavior without predefined categories.
3. Reinforcement Learning:
o In reinforcement learning, an agent learns to make decisions by performing
actions in an environment to maximize a cumulative reward. The agent
receives feedback in the form of rewards or penalties based on its actions.
o Example: Training a robot to navigate through a maze.
4. Semi-supervised Learning (Optional category):
o This is a hybrid between supervised and unsupervised learning. It uses a small
amount of labeled data and a large amount of unlabeled data.
o Example: Image recognition where only a few images are labeled, but the
model can learn from a large amount of unlabeled images.

3. Common Algorithms in Machine Learning

Some well-known machine learning algorithms include:

1. Linear Regression:
o Used for predicting continuous values, such as house prices based on features
like size, location, etc.
2. Decision Trees:
o A model that makes decisions based on answering questions (features) and
traversing a tree structure to reach a prediction.
3. Random Forest:
o An ensemble method that uses multiple decision trees to improve performance
and reduce overfitting.
4. K-Nearest Neighbors (KNN):
o A classification algorithm that makes predictions based on the majority class
of the nearest data points in the feature space.
5. Support Vector Machines (SVM):
o A powerful classifier that tries to find the optimal hyperplane separating
classes in a high-dimensional space.
6. Neural Networks:
o Inspired by the human brain, these networks consist of layers of
interconnected nodes (neurons) and are used in deep learning applications like
image and speech recognition.
7. K-means Clustering:
o A method for unsupervised learning where the algorithm groups data into k
clusters based on feature similarity.

4. Steps in Machine Learning

The typical workflow of a machine learning project involves several key stages:

1. Data Collection: Gathering the relevant data for the problem you're solving.
2. Data Preprocessing: Cleaning and preparing the data for modeling, including
handling missing values, normalizing, and encoding categorical variables.
3. Model Selection: Choosing the appropriate machine learning algorithm.
4. Training the Model: Feeding the training data into the model to allow it to learn
from the data.
5. Evaluation: Assessing the model's performance using metrics like accuracy,
precision, recall, or mean squared error (for regression).
6. Hyperparameter Tuning: Adjusting the model parameters for optimal performance.
7. Deployment: Integrating the trained model into a real-world application.

5. Applications of Machine Learning

Machine learning is already transforming various industries, including:

 Healthcare: Diagnosing diseases, personalized medicine, drug discovery.


 Finance: Fraud detection, algorithmic trading, credit scoring.
 Marketing: Customer segmentation, recommendation systems, targeted advertising.
 Transportation: Autonomous vehicles, traffic prediction.
 Retail: Inventory management, demand forecasting, chatbots.

6. Challenges in Machine Learning

Some challenges that arise in ML include:

 Data Quality: Poor or biased data can lead to inaccurate models.


 Overfitting and Underfitting: A model that fits the training data too well may not
generalize well to new data, and a model that underfits may miss important patterns.
 Interpretability: Some models, particularly deep learning models, can be "black
boxes," making it difficult to understand how they make decisions.
 Computational Resources: Large datasets and complex models require substantial
computing power.

7. Tools and Libraries

Popular libraries and frameworks for machine learning include:


 Scikit-learn: A Python library for traditional machine learning algorithms.
 TensorFlow and Keras: Libraries for building neural networks and deep learning
models.
 PyTorch: A popular deep learning library for research and production.
 XGBoost: A library used for gradient boosting algorithms in competitions.

1.Evolution of Machine Learning

Evolution of Machine Learning

The evolution of machine learning (ML) can be traced through various stages, each
representing a significant leap in our understanding and capability to make machines
"intelligent." From its early theoretical foundations to the modern-day applications powered
by deep learning, machine learning has undergone a profound transformation. Below is an
overview of the key milestones in the evolution of machine learning:

1. Early Foundations (1940s–1950s)

The roots of machine learning can be traced back to the early days of computing and artificial
intelligence (AI):

 Turing's Work (1936-1937): The British mathematician Alan Turing laid the
groundwork for theoretical computing with the concept of the Turing machine
 Neural Networks and Perceptrons (1950s): In the 1950s, early work on neural
networks began with the creation of the Perceptron by Frank Rosenblatt.

2. Symbolic AI and Rule-Based Systems (1950s–1970s)

 Rule-based AI: During the 1950s to 1970s, AI research was dominated by symbolic
approaches..Early ML Algorithms: Researchers began exploring algorithms like
decision trees and clustering methods, though the field was still in its infancy

3. The AI Winter (1970s–1990s)

Despite early successes, progress in AI and machine learning slowed significantly during this
period due to limited computational resources and overly optimistic expectations:

 Challenges in Data and Computing: The limitations of computers at the time, both
in terms of memory and processing power, constrained the development of more
advanced ML algorithms. Additionally, AI and machine learning models struggled to
perform well in real-world, noisy data scenarios.
 AI Winter: This term refers to a period of reduced funding and interest in AI research
during the late 1970s to early 1990s, as results from early ML models did not live up
to expectations.

4. Revival and Statistical Learning (1990s–2000s)


The 1990s saw a resurgence in machine learning, driven by the development of statistical
methods, the increase in computational power, and the availability of larger datasets:

 Introduction of Support Vector Machines (SVMs): In the 1990s, algorithms like


SVMs were developed, offering powerful methods for classification tasks
 Bayesian Networks and Probabilistic Models: Researchers developed new
approaches based on probabilistic reasoning.
 Neural Networks and Backpropagation: While neural networks had been explored
earlier, the backpropagation algorithm in the 1980s (further developed in the 1990s)
enabled multi-layer networks to learn more complex patterns and drove interest in
deep learning.
 Reinforcement Learning: The concept of learning by interacting with an
environment and maximizing rewards.

5. Data-Driven Approaches and Deep Learning (2010s–Present)

The 2010s saw significant breakthroughs in machine learning, particularly in the area of deep
learning:

 Big Data:.
 Rise of Deep Learning: Deep learning :
o ImageNet Breakthrough (2012): The ImageNet competition marked a
pivotal moment when deep learning models, especially convolutional neural
networks (CNNs), drastically outperformed traditional machine learning
algorithms in image classification tasks. This achievement sparked widespread
interest in deep learning.
 Natural Language Processing (NLP) and Transformer Models: In the field of
NLP, algorithms like Word2Vec and later transformers (such as BERT and GPT)
revolutionized language understanding and generation, allowing machines to achieve
human-level performance on tasks like translation, question answering, and text
generation.
 Reinforcement Learning Advancements: Reinforcement learning, notably through
deep Q-learning (DeepMind's AlphaGo playing Go), reached new heights, solving
complex decision-making problems.

2.Paradigms or Types of Machine Learning

Machine learning (ML) can be broadly categorized into different paradigms based on how the
algorithms learn from data and the nature of the problem they aim to solve. Each paradigm
has distinct approaches, techniques, and applications. The most common paradigms are
supervised learning, unsupervised learning, reinforcement learning, and semi-
supervised learning. Below is a detailed explanation of each paradigm:
1. Supervised Learning

Definition: In supervised learning, the model is trained using labeled data. The algorithm
learns the relationship between input data (features) and their corresponding outputs (labels)
during training, and then generalizes this knowledge to make predictions on new, unseen
data.

 How it Works:
o Training Data: Consists of input-output pairs where the output (label) is
known.
o Objective: The goal is to learn a mapping function f(x)f(x)f(x) that maps
inputs xxx to outputs yyy, so the model can predict the output for new, unseen
inputs.

Types of Supervised Learning


Supervised learning is classified into two categories of algorithms:
 Regression: A regression problem is when the output variable is a real
value, such as “dollars” or “weight”.
 Classification : A classification problem is when the output variable is a
category, such as “Red” or “blue” , “disease” or “no disease”.
Supervised learning deals with or learns with “labeled” data. This implies that
some data is already tagged with the correct answer.
1- Regression
Regression is a type of supervised learning that is used to predict continuous
values, such as house prices, stock prices, or customer churn. Regression
algorithms learn a function that maps from the input features to the output
value.
Some common regression algorithms include:
 Linear Regression
 Polynomial Regression
 Support Vector Machine Regression
 Decision Tree Regression
 Random Forest Regression
2- Classification
Classification is a type of supervised learning that is used to predict
categorical values, such as whether a customer will churn or not, whether an
email is spam or not, or whether a medical image shows a tumor or not.
Classification algorithms learn a function that maps from the input features to
a probability distribution over the output classes.
Some common classification algorithms include:
 Logistic Regression
 Support Vector Machines
 Decision Trees
 Random Forests
 Naive Baye

 Applications:
o Image classification (e.g., identifying cats and dogs in pictures).
o Email spam detection.
o Sentiment analysis (classifying reviews as positive or negative).

2. Unsupervised Learning

Definition: In unsupervised learning, the model is provided with data that has no labels. The
goal is to identify underlying patterns or structures in the data without predefined outputs.
 How it Works:
o Training Data: Contains only input data, and the algorithm must find hidden
patterns or structures within this data.
o Objective: The model aims to explore the data and organize it into clusters,
dimensions, or other structures.

Types of Unsupervised Learning

Unsupervised learning is classified into two categories of algorithms:


 Clustering: A clustering problem is where you want to discover the inherent groupings
in the data, such as grouping customers by purchasing behavior.
 Association: An association rule learning problem is where you want to discover rules
that describe large portions of your data, such as people that buy X also tend to buy Y.
Clustering
Clustering is a type of unsupervised learning that is used to group similar data points
together. Clustering algorithms work by iteratively moving data points closer to their
cluster centers and further away from data points in other clusters.
1. Exclusive (partitioning)
2. Agglomerative
3. Overlapping
4. Probabilistic
Clustering Types:-
1. Hierarchical clustering
2. K-means clustering
3. Principal Component Analysis
4. Singular Value Decomposition
5. Independent Component Analysis
6. Gaussian Mixture Models (GMMs)
7. Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
Association rule learning
Association rule learning is a type of unsupervised learning that is used to identify patterns
in a data. Association rule learning algorithms work by finding relationships between
different items in a dataset.
Some common association rule learning algorithms include:
 Apriori Algorithm
 Eclat Algorithm
 FP-Growth Algorithm

 Applications:
o Customer segmentation (grouping customers based on purchasing behavior).
o Anomaly detection (detecting fraud or network intrusions).
o Topic modeling (grouping documents into topics based on word patterns).

3. Reinforcement Learning

Definition: Reinforcement learning (RL) is a paradigm where an agent learns to make


decisions by interacting with an environment to maximize a cumulative reward. The agent
does not know the right actions initially but learns from feedback after each action it takes.

 How it Works:
o Agent: The decision-making entity that interacts with the environment.
o Environment: The world in which the agent operates.
o State: The current situation or configuration of the environment.
o Action: The choices the agent can make in the environment.
o Reward: The feedback the agent receives after taking an action. The objective
is to maximize the total reward over time.
o Policy: A strategy the agent uses to determine actions based on states.
o Value Function: A function that estimates the expected return (reward) for
each state.

Types of Reinforcement:
1. Positive: Positive Reinforcement is defined as when an event, occurs due
to a particular behavior, increases the strength and the frequency of the
behavior. In other words, it has a positive effect on behavior.
Advantages of reinforcement learning are:

 Maximizes Performance
 Sustain Change for a long period of time
 Too much Reinforcement can lead to an overload of states which can
diminish the results
2. Negative: Negative Reinforcement is defined as strengthening of behavior
because a negative condition is stopped or avoided.
Advantages of reinforcement learning:

 Increases Behavior
 Provide defiance to a minimum standard of performance
 It Only provides enough to meet up the minimum behavior
Elements of Reinforcement Learning
i) Policy: Defines the agent’s behavior at a given time.
ii) Reward Function: Defines the goal of the RL problem by providing
feedback.
iii) Value Function: Estimates long-term rewards from a state.
iv) Model of the Environment: Helps in predicting future states and rewards
for planning.

 Applications:
o Game playing (e.g., AlphaGo, chess, and video games).
o Robotics (e.g., teaching robots to walk or pick objects).
o Autonomous vehicles (e.g., self-driving cars making navigation decisions).
o Healthcare (e.g., personalized treatment recommendations).

4. Semi-Supervised Learning
Definition: Semi-supervised learning is a paradigm that falls between supervised and
unsupervised learning. The model is trained with a small amount of labeled data and a large
amount of unlabeled data. The goal is to leverage the unlabeled data to improve the learning
process.

 How it Works:
o Training Data: Consists of both labeled and unlabeled data.
o Objective: The algorithm attempts to use the small amount of labeled data to
make sense of the unlabeled data and make more accurate predictions or
classifications.
 Key Algorithms:
o Semi-Supervised Support Vector Machines (S3VM): An extension of SVM
that works with both labeled and unlabeled data.
o Self-training: An iterative approach where the model initially trains on the
labeled data, and then iteratively labels the unlabeled data to improve
performance.
o Graph-based Methods: Use relationships (edges) between labeled and
unlabeled data points to propagate labels.
 Applications:
o Image recognition: Labeled images may be scarce, but large unlabeled
datasets can help improve accuracy.
o Speech recognition: Labeled audio data may be limited, but using large
amounts of unlabeled audio can enhance training.
o Medical diagnostics: Labeling medical images can be time-consuming and
expensive, so semi-supervised techniques help utilize the available data
effectively.

5. Self-Supervised Learning (Emerging Paradigm)

Definition: Self-supervised learning is a technique where a model generates its own labels
from the data itself. It creates a pretext task (an auxiliary task) to learn useful representations
from unlabeled data, which can then be fine-tuned for specific downstream tasks.

 How it Works:
o Pretext Task: A task that is designed so the model learns useful features from
unlabeled data. For instance, predicting missing parts of an image, filling in
missing words in a sentence, or predicting the next frame in a video.
o Transfer Learning: The representations learned from the pretext task are
transferred to solve other tasks that require supervised learning.
 Key Algorithms:
o Contrastive Learning: A method where the model learns by contrasting
positive and negative samples (e.g., SimCLR, MoCo).
o Masked Language Models (MLMs): In NLP, models like BERT are pre-
trained on tasks like predicting masked words to understand language
representations.
 Applications:
o Natural Language Processing: Pretraining language models (e.g., GPT,
BERT) using self-supervised learning.
o Computer Vision: Pretraining models for tasks like object detection using
unlabeled images.
o Robotics: Learning representations from raw sensory data without explicit
supervision.

3. Learning by Rote:
Learning by rote, also known as rote learning, is a method of learning that involves
memorization without understanding the underlying meaning or principles of the material. It
typically relies on repetition to commit information to memory.

Characteristics of Rote Learning:

1. Memorization Over Understanding: Focuses on repeating facts, figures, or


procedures rather than understanding concepts.
2. Mechanical Process: Often involves repetitive drills, such as reciting multiplication
tables or vocabulary lists.
3. Lack of Critical Thinking: Little emphasis is placed on applying knowledge to
different contexts or analyzing its implications.

Advantages of Rote Learning:

 Quick Recall: Useful for memorizing foundational information like formulas,


definitions, or historical dates.
 Efficiency in Specific Situations: Essential in areas where exact recall is necessary,
such as medical terminology or procedural steps.
 Basis for Higher Learning: Sometimes foundational memorization is needed before
deeper understanding can occur.

Disadvantages of Rote Learning:

 Limited Application: Knowledge may not transfer well to new situations or problem-
solving tasks.
 Lack of Engagement: Can be monotonous and less engaging, leading to decreased
motivation.
 Shallow Learning: Students may forget material quickly if it's not linked to
understanding.

Alternatives to Rote Learning:

 Conceptual Learning: Focuses on understanding the principles and ideas behind


facts.
 Active Learning: Encourages hands-on activities, discussions, and problem-solving
to deepen understanding.
 Mnemonics and Associations: Creates connections between new information and
existing knowledge for better retention.

4 Learning by Induction:

Learning by induction in machine learning (ML) refers to a process where a model


generalizes patterns from specific examples to broader rules or principles. The model learns
to infer relationships and make predictions about unseen data based on patterns it has
identified in the training dataset.

Key Concepts in Learning by Induction:

1. Training Data: The model is trained on labeled examples (in supervised learning) or
unlabeled examples (in unsupervised learning) to discover patterns.
2. Generalization: The goal is to enable the model to apply the rules or relationships it
has inferred from the training data to new, unseen instances.
3. Inductive Bias: Assumptions the model makes about the underlying data distribution,
guiding its learning process and influencing its generalization.

Induction in ML Paradigms:

1. Supervised Learning:
o Uses labeled datasets (input-output pairs).
o Example: Inferring the relationship y=f(x)y = f(x)y=f(x) where xxx is the
input and yyy is the output.
o Models: Decision trees, neural networks, support vector machines.
2. Unsupervised Learning:
o Identifies patterns or structures in unlabeled data.
o Example: Clustering data points based on similarities.
o Models: K-means, hierarchical clustering, PCA.
3. Reinforcement Learning:
o Learns from rewards or penalties by interacting with an environment.
o Example: Inferring an optimal strategy in a game.

Steps in Learning by Induction:

1. Hypothesis Formation: Generate hypotheses or rules based on the training data.


o Example: A decision tree splits data into subsets based on feature values.
2. Testing and Refinement: Evaluate hypotheses against validation data and adjust
them to improve performance.
3. Generalization: Use the refined hypothesis to make predictions about unseen data.

Advantages of Inductive Learning:

 Flexibility: Works well across various domains with diverse types of data.
 Scalability: Can handle large datasets with appropriate algorithms.
 Automation: Reduces the need for manual rule-setting by discovering patterns from
data.

Example:

 Classification Task: Training a model to classify emails as spam or not spam.


o Training data: Emails labeled as "spam" or "not spam."
o Hypothesis: Rules or patterns, such as frequent occurrences of certain
keywords, that define spam.
o Generalization: Classifying new emails based on the learned patterns.

5. Reinforcement Learning:

Definition: Reinforcement learning (RL) is a paradigm where an agent learns to make


decisions by interacting with an environment to maximize a cumulative reward. The agent
does not know the right actions initially but learns from feedback after each action it takes.

 How it Works:
o Agent: The decision-making entity that interacts with the environment.
o Environment: The world in which the agent operates.
o State: The current situation or configuration of the environment.
o Action: The choices the agent can make in the environment.
o Reward: The feedback the agent receives after taking an action. The objective
is to maximize the total reward over time.
o Policy: A strategy the agent uses to determine actions based on states.
o Value Function: A function that estimates the expected return (reward) for
each state.

Types of Reinforcement:
4. Positive: Positive Reinforcement is defined as when an event, occurs due
to a particular behavior, increases the strength and the frequency of the
behavior. In other words, it has a positive effect on behavior.
Advantages of reinforcement learning are:

 Maximizes Performance
 Sustain Change for a long period of time
 Too much Reinforcement can lead to an overload of states which can
diminish the results
5. Negative: Negative Reinforcement is defined as strengthening of behavior
because a negative condition is stopped or avoided.
Advantages of reinforcement learning:

 Increases Behavior
 Provide defiance to a minimum standard of performance
 It Only provides enough to meet up the minimum behavior
Elements of Reinforcement Learning
i) Policy: Defines the agent’s behavior at a given time.
ii) Reward Function: Defines the goal of the RL problem by providing
feedback.
iii) Value Function: Estimates long-term rewards from a state.
iv) Model of the Environment: Helps in predicting future states and rewards
for planning.

 Applications:
o Game playing (e.g., AlphaGo, chess, and video games).
o Robotics (e.g., teaching robots to walk or pick objects).
o Autonomous vehicles (e.g., self-driving cars making navigation decisions).
o Healthcare (e.g., personalized treatment recommendations).

 6
6.TYPES OF DATA
Data in machine learning are broadly categorized into two types − numerical
(quantitative) and categorical (qualitative) data. The numerical data can be
measured, counted or given a numerical value, for example, age, height,
income, etc. The categorical data is non-numeric data that can be arranged
in categories with or without meaningful order, for example, gender, blood
group, etc.

Numerical (Quantitative) Data

The numerical (quantitative) data is data that can be measured, counted or given a numerical
value. The examples of numerical data are age, height, income, number of students in class,
number of books in a shelf, shoe size, etc.

The numerical data can be categorized into the folloiwng two types −

 Discrete Data
 Continuous Data

1. Discrete Data

The discrete data is numerical data that is countable, finite, and can only take certain values,
usually whole numbers. Examples of discrete data are number of students in class, number of
books in a shelf, shoe size, number of ducks in a pond, etc.

2. Continuous Data

The continuous data is numerical data that can take any value within a specified range
including fractions and decimals. Examples of continuous data are age, height, weight,
income, time, temperature, etc.
What is true zero?

True zero represents the absence of the quantity being measured. For example, height,
weight, age, temperature in Kelvin are examples of data with true zero. As the height with 0
CM represents the absolute absence of height, 0K temperature represents no heat. But
temperature in Celsius (or Fahrenheit) is an example of data with false zero.

We can categorize the numerical data into the following two types on basis of true zero −

 interval data − quantitative data with equal intervals between data points. Examples are
temperature (Fahrenheit), temperature (Celsius), pH, SAT score (200-800), credit score (300-
850), etc.
 ratio data − same as interval data but with true zero. Examples are weight in KG, number of
students, income, speed, etc.

Categorical (Qualitative) Data


The categorical (qualitative) data can be categorized with or without a meaningful order. For
example, gender, blood group, hair color, nationality, the school grades, level of education,
range of income, ratings, etc.

The categorical data can be divided into the folloiwng two types −

 Nominal Data
 Ordinal Data

1. Nominal Data

The nominal data is categorical data that can not be arranged in an order or rank. The
examples of nominal data are gender, blood group, hair color, nationality, etc.

2. Ordinal Data

The ordinal data is categorical data can be ordered or ranked with a specific attribute. The
examples of ordinal data are the school grades, level of education, range of
income, ratings, etc.

7 Matching:

 Definition: Matching in machine learning refers to the task of comparing different


data points to identify similarities or differences.
 Examples:
o Pattern Matching: Identifying patterns in text, such as finding similar words
or phrases.
o Image Matching: Comparing features of images to find similarities or
objects.

In machine learning, matching refers to the process of finding similar, related, or


corresponding entities between two datasets or within a dataset. Matching is widely used in
various applications, and the techniques vary depending on the type of data and the problem
at hand. Here's an overview of matching in ML:
1. Types of Matching

a. Exact Matching

 Involves finding identical values or entities.


 Example: Comparing user IDs or email addresses between two datasets.

b. Approximate Matching

 Finds entities that are not exactly the same but are similar based on some criteria.
 Example: Matching names with slight spelling differences (e.g., "John" and "Jon").

2. Matching Techniques

a. String Matching

 Used for text data, such as names, addresses, or descriptions.


 Common techniques:
o Levenshtein Distance (edit distance)
o Jaccard Similarity
o Cosine Similarity
o TF-IDF (for document matching)
o Soundex (phonetic matching)

b. Feature-Based Matching

 Matches entities based on a set of features or attributes.


 Techniques:
o Euclidean Distance (for numerical features)
o Mahalanobis Distance
o K-Nearest Neighbors (KNN)
o Cluster-Based Matching (using clustering algorithms like K-Means)

c. Entity Matching

 Matches entities (e.g., customers, products) across different datasets.


 Often involves multiple attributes and requires data cleaning.
 Tools and libraries: Dedupe, Record Linkage Toolkit.

d. Image Matching

 Matches or finds similar images.


 Techniques:
o Feature Extraction: SIFT, SURF, ORB.
o Deep Learning: Convolutional Neural Networks (CNNs) for embeddings.
o Perceptual Hashing.

e. Graph-Based Matching

 Matches nodes or entities in graphs.


 Example: Matching users in social network graphs.
 Techniques: Graph similarity algorithms, node embeddings (e.g., Node2Vec).

3. Applications of Matching in ML

a. Recommendation Systems
 Matching users to items or items to items based on preferences or features.

b. Data Deduplication

 Identifying and merging duplicate records in datasets.

c. Entity Resolution

 Matching and reconciling entities across different sources.


 Example: Identifying if "John Smith" in Dataset A is the same as "J. Smith" in Dataset B.

d. Fraud Detection

 Matching patterns of fraudulent activity.

e. Image and Video Search

 Matching visual content for retrieval or identification.

f. Medical Record Linking

 Matching patient records across hospitals or healthcare providers.

4. Challenges in Matching

 Data Quality: Inconsistent or noisy data can complicate matching.


 Scalability: Matching large datasets requires efficient algorithms.
 Ambiguity: Similar entities might not always match (e.g., common names).
 Domain-Specific Knowledge: Requires understanding of the context to set appropriate
criteria.

5. Tools and Frameworks

 Dedupe (Python): Library for entity matching and deduplication.


 FuzzyWuzzy: Python library for fuzzy string matching.
 scikit-learn: Offers distance metrics and clustering algorithms for matching.
 TensorFlow/PyTorch: For deep learning-based matching tasks.
 ELKI: Specialized for distance-based data mining and matching.

8. Stages in Machine Learning:

Machine learning typically involves several stages:

 Data Collection: Gathering raw data that is relevant to the problem.


 Data Preprocessing: Cleaning and transforming data into a usable format.
 Feature Engineering: Selecting and creating features from raw data that will be used
by the model.
 Model Selection: Choosing the appropriate algorithm or model for the task.
 Model Training: Using training data to teach the model.
 Model Evaluation: Assessing the model’s performance on unseen data
(validation/test sets).
 Model Prediction: Using the trained model to make predictions on new data.

(OR)
Machine Learning (ML) involves several key stages, from problem formulation to deploying
a model. Here's a breakdown of the typical stages in an ML workflow:

1. Problem Definition

 Objective: Clearly define the problem you want to solve and the desired outcomes.
 Key Steps:
o Identify the type of ML task (e.g., classification, regression, clustering,
recommendation).
o Define success metrics (e.g., accuracy, F1-score, mean squared error).

2. Data Collection

Gather data relevant to the problem. Data collection is the process of collecting and
evaluating information or data from multiple sources to find answers to research
problems, answer questions, evaluate outcomes, and forecast trends and
probabilities. It is an essential phase in all types of research, analysis, and decision-
making, including that done in the social sciences, business, and healthcare.

 Key Steps:
o Identify data sources (e.g., databases, APIs, sensors).
o Scrape or retrieve data from multiple channels.
o Ensure sufficient quantity and variety of data.

3. Data Preprocessing
Objective: Clean and prepare data for analysis. Data preprocessing is a process of
preparing the raw data and making it suitable for a machine learning model. It is the first
and crucial step while creating a machine learning model.

When creating a machine learning project, it is not always a case that we come across
the clean and formatted data. And while doing any operation with data, it is mandatory to
clean it and put in a formatted way. So for this, we use data preprocessing task.

 Key Steps:
o Handle missing values (e.g., imputation, deletion).
o Remove duplicates and outliers.
o Normalize or standardize features.
o Convert data types (e.g., categorical to numerical encoding).
o Split data into training, validation, and testing sets.

4. Exploratory Data Analysis (EDA)

 Objective: Understand the data and uncover patterns. Exploratory Information


Examination is a process that involves assessing and evaluating informational
sets to lay out their main attributes. Frequently, statistical graphics and other data
visualization strategies are used. EDA's significant objective is to recognize
designs, patterns, connections, and peculiarities in information, offering bits of
knowledge that can be utilized to direct extra exploration or speculation plans.
 Key Steps:
o Visualize data distributions and relationships.
o Identify trends, correlations, and anomalies.
o Use tools like histograms, scatter plots, heatmaps.

5. Feature Engineering

 Feature
Objective: Create and select features that enhance model performance.
engineering is the process of transforming raw data into features that
are suitable for machine learning models. In other words, it is the
process of selecting, extracting, and transforming the most relevant
features from the available data to build more accurate and efficient
machine learning models.
 Key Steps:
o Generate new features from existing data (e.g., date-time features).
o Encode categorical variables (e.g., one-hot encoding).
o Select relevant features using techniques like correlation analysis or PCA.
o Reduce dimensionality if needed.

6. Model Selection

 Objective: Choose an appropriate ML algorithm for the task. Model selection is an


essential phase in the development of powerful and precise predictive models
in the field of machine learning. Model selection is the process of deciding
which algorithm and model architecture is best suited for a particular task or
dataset. It entails contrasting various models, assessing their efficacy, and
choosing the one that most effectively addresses the issue at hand.
 Key Steps:
o Consider the type of problem (e.g., linear regression for continuous outputs,
decision trees for classification).
o Evaluate algorithm assumptions and requirements.
o Consider computational efficiency and scalability.

7. Model Training
Train the selected model on the training dataset. A training model is a dataset that is used to
train an ML algorithm. It consists of the sample output data and the corresponding sets of
input data that have an influence on the output. The training model is used to run the input
data through the algorithm to correlate the processed output against the sample output. The
result from this correlation is used to modify the model.

This iterative process is called “model fitting”. The accuracy of the training dataset or the
validation dataset is critical for the precision of the model.

Model training in machine language is the process of feeding an ML algorithm with data to
help identify and learn good values for all attributes involved. There are several types of
machine learning models, of which the most common ones are supervised and unsupervised
learning.

 Key Steps:
o Optimize model parameters using training data.
o Use libraries like scikit-learn, TensorFlow, or PyTorch.

8. Model Evaluation
 Model evaluation is the process that uses some metrics which help us to
analyze the performance of the model. As we all know that model
development is a multi-step process and a check should be kept on how
well the model generalizes future predictions. Therefore evaluating a
model plays a vital role so that we can judge the performance of our
model. The evaluation also helps to analyze a model’s key weaknesses.
There are many metrics like Accuracy, Precision, Recall, F1 score, Area
under Curve, Confusion Matrix, and Mean Square Error. Cross
Validation is one technique that is followed during the training phase
and it is a model evaluation technique as well.
 Cross Validation and Holdout
 Cross Validation is a method in which we do not use the whole dataset
for training. In this technique, some part of the dataset is reserved for
testing the model. There are many types of Cross-Validation out of
which K Fold Cross Validation is mostly used. In K Fold Cross
Validation the original dataset is divided into k subsets. The subsets are
known as folds. This is repeated k times where 1 fold is used for testing
purposes. Rest k-1 folds are used for training the model. So each data
point acts as a test subject for the model as well as acts as the training
subject. It is seen that this technique generalizes the model well and
reduces the error rate
 Key Steps:
o Use validation and test datasets.
o Evaluate using appropriate metrics (e.g., accuracy, precision, recall, RMSE).
o Perform cross-validation for robustness.

9. Hyperparameter Tuning

 Hyperparameter tuning is the process of selecting the optimal values for


a machine learning model’s hyperparameters. Hyperparameters are settings
that control the learning process of the model, such as the learning rate, the
number of neurons in a neural network, or the kernel size in a support
vector machine. The goal of hyperparameter tuning is to find the values
that lead to the best performance on a given task.
 Key Steps:
o Use techniques like grid search, random search, or Bayesian optimization.
o Evaluate the impact of hyperparameters on performance.

10. Model Deployment

 Objective: Integrate the model into a production environment.


 Key Steps:
o Deploy the model as an API, microservice, or embedded system.
o Monitor the model’s performance in real-time.
o Use tools like Flask, FastAPI, or ML platforms (e.g., AWS Sagemaker).

11. Monitoring and Maintenance

 Objective: Ensure the model remains accurate and relevant over time.
 Key Steps:
o Monitor for data drift or concept drift.
o Retrain the model with updated data as needed.
o Log errors and feedback for continuous improvement.
12. Model Improvement and Iteration

 Objective: Refine the model based on feedback and new insights.


 Key Steps:
o Collect new data and update the training pipeline.
o Experiment with alternative algorithms or architectures.
o Test new features or preprocessing methods.

9. Data Acquisition:

Data acquisition refers to the process of gathering data from various sources, such as sensors,
databases, or web scraping. The process of collecting and storing data for
machine learning from a variety of sources is known as data acquisition
(DAQ).
The procedure entails gathering, examining, and using crucial data to guarantee
precise measurements, instantaneous observation, and knowledgeable decision-
making. Sensors, measuring devices, and a computer work together in DAQ
systems to transform physical parameters into electrical signals, condition and
amplify those signals, and then store them for analysis.

Components of Data Acquisition System


To understand how data is selected and processed, a data acquisition system
consists of below key basic components: sensors, measuring instruments, and a
computer.
1. Sensors: Sensors are devices that quantify and translate physical parameters
like voltage, pressure, or temperature into electrical impulses. Later, these signals
are sent to the measuring devices for additional analysis.
2.Signal Conditioner: Signal conditioning is the process of improving raw
sensor signals so they can be reliably understood. To make sure that the signals
are dependable, clear, and compatible with the rest of the system, signal
conditioning procedures include isolation, amplification, and filtering.
 Amplification: It helps in improving accuracy by maximizing the signal
strength
 Filtering: Filters extra and unwanted noise from the signal
 Isolation: Helps in separating sensor from DAQ system.
3. Analog-to-digital Converter: After the signals are conditioned, they must be
translated into a digital format that computers can comprehend using an analog-
to-digital converter (ADC). The continuous analog signals are transformed into
discrete digital values so that the system can process and store them.
4. Data Logger: The data logger serves as the operation's central nervous system.
A device or software program known as a data logger is responsible for managing
incoming data, controlling the acquisition process, and storing it for subsequently
analysis.
5. Data Processing Unit: After receiving data from ADC, the system has
dedicated card to process the signals like sampling, buffering and Data Transfer.
6. Data Storage : Acquired data is stored in the computer’s memory for real-time
monitoring.
Challenges in Data Acquisition

 Data Availability: Difficulty finding suitable data sources.


 Data Bias: Uneven representation leading to biased model predictions.
 Data Completeness: Missing data points or incomplete datasets.
 Data Integration: Combining data from multiple sources with different formats.
 Cost: High expense in obtaining proprietary data or infrastructure.

10. Feature Engineering:

 Definition: Feature engineering is the process of selecting, modifying, or creating


new features from raw data to improve the performance of machine learning models.
 Techniques:
o Feature Selection: Choosing the most relevant features.
o Feature Transformation: Scaling, normalizing, or encoding features.
o Feature Extraction: Creating new features from existing ones, such as
extracting text sentiment or converting time into day parts.

Feature Engineering is a crucial step in the machine learning pipeline. It involves creating,
transforming, or selecting features (variables) from raw data to improve the performance and
accuracy of a machine learning model. The goal is to provide the model with the most
relevant and informative features.

1. Why Feature Engineering is Important

 Improves Model Performance: Better features often lead to better predictions.


 Reduces Complexity: Removes irrelevant or redundant features.
 Handles Data Limitations: Creates meaningful features from limited or noisy data.
 Enhances Interpretability: Makes the model easier to understand by using domain-specific
features.

2. Key Steps in Feature Engineering

a. Feature Selection

 Selecting the most important features and removing irrelevant or redundant ones.
 Techniques:
o Filter Methods: Correlation, mutual information, chi-square.
o Wrapper Methods: Recursive Feature Elimination (RFE).
o Embedded Methods: Lasso regularization, decision tree feature importance.

b. Feature Transformation

 Transforming raw features into a format that makes them more usable.
 Techniques:
o Scaling: Normalize or standardize features (e.g., Min-Max Scaling, Z-Score
Normalization).
o Encoding: Convert categorical variables into numerical (e.g., One-Hot Encoding,
Label Encoding).
o Log Transformation: Reduces skewness in data.
o Polynomial Features: Adds interaction terms or higher-order terms for non-linear
relationships.
c. Feature Creation

 Creating new features from existing ones.


 Techniques:
o Aggregation: Compute statistics like mean, median, or sum.
o Time-Based Features: Extract features like day, month, or time intervals.
o Text Features: Use techniques like TF-IDF, bag of words, or embeddings.
o Domain Knowledge: Incorporate features derived from expertise in the subject
matter.

d. Feature Encoding

 Converting non-numeric data into numeric data.


 Common methods:
o Binary Encoding
o Frequency Encoding
o Hashing

e. Dimensionality Reduction

 Reducing the number of features while retaining most of the information.


 Techniques:
o Principal Component Analysis (PCA)
o t-SNE
o Autoencoders

3. Tools for Feature Engineering

 Python Libraries:
o scikit-learn: Feature selection and preprocessing tools.
o pandas: Data manipulation and creation of new features.
o NumPy: Mathematical operations for feature transformations.
o Featuretools: Automated feature engineering.
 R Libraries: dplyr, caret.

4. Examples of Feature Engineering

a. Numerical Features

 Normalization: Convert raw income values to a range between 0 and 1.


 Bin Features: Divide ages into groups (e.g., 0-18, 19-35, 36-60).

b. Categorical Features

 One-Hot Encoding: Convert "City" with values [New York, London, Paris] into binary
columns.
 Frequency Encoding: Replace categories with their occurrence frequencies.

c. Time Series Features

 Extract "Day of Week," "Hour of Day," or "Time Since Last Event."

d. Text Data

 Use word embeddings (e.g., Word2Vec, GloVe) to represent text numerically.


11. Data Representation:

 Definition: Data representation refers to how data is structured and stored so that
machine learning models can interpret it effectively.
 Types:
o Numerical Representation: Using numbers to represent data, like integers or
floating-point numbers.
o Categorical Representation: Using categories or labels, often transformed
into numerical form using techniques like one-hot encoding.
o Vector Representation: Representing data in vectorized form, often used in
natural language processing or image processing.

Data Representation in Machine Learning (ML) refers to how raw data is transformed,
structured, and encoded into a format that machine learning algorithms can process. Choosing
the right representation of data is critical to building effective models.

1. Importance of Data Representation

 Compatibility: Machine learning algorithms require data in numerical format.


 Interpretability: Good representation makes it easier to understand the relationships within
data.
 Efficiency: Proper representation improves the performance of models in terms of speed
and accuracy

2. Typesof Data Representation

a. Numerical Representation

 Raw Numbers: Used for features like age, income, or temperature.


 Normalized Values: Scaled to a specific range (e.g., 0 to 1) to prevent dominance of large
values.

b. Categorical Representation

 One-Hot Encoding: Converts categories into binary vectors. Example:


o "Red, Green, Blue" → [1, 0, 0], [0, 1, 0], [0, 0, 1].
 Ordinal Encoding: Assigns numeric values to ordered categories. Example:
o "Low, Medium, High" → 0, 1, 2.
 Binary Encoding: Encodes categories into binary numbers.

c. Text Representation

 Bag of Words (BoW): Represents text as a frequency vector of words.


 TF-IDF (Term Frequency-Inverse Document Frequency): Adjusts word frequency based on
importance across documents.
 Word Embeddings: Dense vector representations capturing semantic relationships.
Examples:
o Word2Vec, GloVe, FastText.

d. Time-Series Representation

 Sequences: Represented as a series of values over time. Example:


o [23, 25, 28, 30].
 Feature Extraction: Mean, variance, trend, seasonality, etc.
e. Image Representation

 Pixel Values: Represent images as matrices of pixel intensity values.


 Feature Maps: Extracted using convolutional layers in deep learning models.

f. Graph Representation

 Adjacency Matrices: Represents connections between nodes.


 Node Embeddings: Captures relationships between nodes in lower dimensions (e.g.,
Node2Vec).

g. Audio Representation

 Waveforms: Raw audio signals.


 Spectrograms: Visual representation of frequency changes over time.
 MFCCs (Mel Frequency Cepstral Coefficients): Extracted features for audio analysis.

3. Transformations for Data Representation

 Scaling and Normalization:


o StandardScaler: Standardizes data to have zero mean and unit variance.
o MinMaxScaler: Scales data to a fixed range (e.g., [0, 1]).
 Encoding Techniques:
o Convert non-numeric data into numeric formats.
 Dimensionality Reduction:
o PCA, t-SNE, or Autoencoders to reduce feature space

4. Factors Influencing Representation

 Type of Data: Numerical, categorical, text, image, etc.


 Algorithm Requirements: Linear models, neural networks, decision trees, etc., have
different data needs.
 Model Interpretability: Some representations are easier for humans to interpret.
 Computational Resources: High-dimensional data requires more storage and processing
power

12. Model Selection:

Model Selection in Machine Learning (ML) is the process of choosing the most
appropriate algorithm or model for a given problem. The goal is to identify the
model that best balances bias and variance to make accurate predictions. The
choice of model depends on various factors, including the type of data, the
problem at hand, the computational resources, and the performance
requirements.
 Problem formulation: Clearly express the issue at hand, including the kind of
predictions or task that you'd like the model to carry out (for example,
classification, regression, or clustering).
 Candidate model selection: Pick a group of models that are appropriate for
the issue at hand. These models can include straightforward methods like
decision trees or linear regression as well as more sophisticated ones like deep
neural networks, random forests, or support vector machines.
 Performance evaluation: Establish measures for measuring how well each
model performs. Common measurements include area under the receiver's
operating characteristic curve (AUC-ROC), recall, F1-score, mean squared
error, and accuracy, precision, and recall. The type of problem and the
particular requirements will determine which metrics are used.
 Training and evaluation: Each candidate model should be trained using a
subset of the available data (the training set), and its performance should be
assessed using a different subset (the validation set or via cross-validation). The
established evaluation measures are used to gauge the model's effectiveness.
 Model comparison: Evaluate the performance of various models and
determine which one performs best on the validation set. Take into account
elements like data handling capabilities, interpretability, computational
difficulty, and accuracy.
 Hyperparameter tuning: Before training, many models require that certain
hyperparameters, such as the learning rate, regularisation strength, or the
number of layers that are hidden in a neural network, be configured. Use
methods like grid search, random search, and Bayesian optimization to identify
these hyperparameters' ideal values.
 Final model selection: After the models have been analyzed and fine-tuned,
pick the model that performs the best. Then, this model can be used to make
predictions based on fresh, unforeseen data.

3. Model Evaluation

After selecting a model, it’s crucial to evaluate its performance using


appropriate metrics:

a. Classification Metrics

 Accuracy: Proportion of correct predictions.


 Precision, Recall, F1-Score: Used to evaluate performance in
imbalanced datasets.
 Confusion Matrix: Visualizes true positives, false positives, true
negatives, and false negatives.
 ROC-AUC: Evaluates model performance across different thresholds.

b. Regression Metrics

 Mean Absolute Error (MAE): Average of absolute errors.


 Mean Squared Error (MSE): Average of squared errors.
 R-squared: Proportion of variance in the target variable explained by the
model.

Model Selection Process

Define the Problem: Clearly understand the task (e.g., classification,


regression, clustering).

b. Consider Data Type and Size: Select models based on data type (e.g.,
numerical, categorical) and volume (small vs. large datasets).
c. Evaluate Simpler Models First: Start with simpler models (e.g., linear
regression, decision trees) to set a baseline.
d. Experiment with Complex Models: Try more sophisticated models (e.g.,
neural networks, gradient boosting) if simpler models do not perform well.
e. Hyperparameter Tuning: Use techniques like Grid Search, Random Search,
or Bayesian Optimization to fine-tune hyperparameters.
f. Cross-Validation: Use k-fold cross-validation to avoid overfitting and get a
more reliable estimate of model performance.

5. Challenges in Model Selection

 Overfitting vs. Underfitting: Choosing a model with the right


complexity.
 Hyperparameter Tuning: Finding the best configuration for a given
model.
 Computational Cost: Some models require more resources (e.g., deep
learning models).
 Interpretability: Balancing model performance with ease of
interpretation

13. Model Learning:

 Definition: Model learning refers to the process of training a machine learning model
by feeding it data so that it can learn patterns or relationships in the data.

Model learning in machine learning (ML) refers to the process by which a model is
trained to make predictions or take actions based on data. Here's a breakdown of the
model learning process:
Types of Model Learning

1. Supervised Learning: The model learns from labeled data, where the correct output is
provided for each input.

2. Unsupervised Learning: The model learns from unlabeled data, identifying patterns and
relationships on its own.

3. Semi-supervised Learning: A combination of supervised and unsupervised learning,


where some data is labeled and some is not.

4. Reinforcement Learning: The model learns through trial and error, receiving rewards or
penalties for its actions.

Steps in Model Learning

1. Data Preparation: Collecting, cleaning, and preprocessing the data.

2. Model Selection: Choosing a suitable algorithm and model architecture.

3. Model Training: Training the model using the prepared data.

4. Model Evaluation: Assessing the model's performance using metrics such as accuracy,
precision, and recall.

5. Model Tuning: Adjusting hyperparameters to optimize the model's performance.

6. Model Deployment: Deploying the trained model in a production environment.

Key Concepts in Model Learning

1. Overfitting: When a model is too complex and performs well on training data but poorly
on new data.

2. Underfitting: When a model is too simple and fails to capture the underlying patterns in the
data.

3. Bias-Variance Tradeoff: The balance between model complexity and error.

4. Regularization: Techniques such as L1 and L2 regularization to prevent overfitting.

5. Early Stopping: Stopping the training process when the model's performance on the
validation set starts to degrade.
Popular Machine Learning Algorithms

1. Linear Regression: A linear model for predicting continuous outcomes.

2. Decision Trees: A tree-based model for classification and regression tasks.

3. Random Forests: An ensemble model combining multiple decision trees.

4. Support Vector Machines (SVMs): A linear or non-linear model for classification and
regression tasks.

5. Neural Networks: A complex model inspired by the human brain, used for a wide range of
tasks.

14. Model Evaluation:

 Definition: Model evaluation is the process of assessing the performance of a model


using various metrics and validation techniques.
 Methods:
 Accuracy: Accuracy is defined as the ratio of the number of correct
predictions to the total number of predictions. This is the most
fundamental metric used to evaluate the model. The formula is
given by

Accuracy = (TP+TN)/(TP+TN+FP+FN)

EXAMPLE: print("Accuracy:", accuracy_score(y_test,y_pred))

Output:
Accuracy: 0.9333333333333333

Precision and Recall


Precision is the ratio of true positives to the summation of true positives
and false positives. It basically analyses the positive predictions.
Precision = TP/(TP+FP)
The drawback of Precision is that it does not consider the True Negatives
and False Negatives.
Recall is the ratio of true positives to the summation of true positives and
false negatives. It basically analyses the number of correct positive
samples.
Recall = TP/(TP+FN)
The drawback of Recall is that often it leads to a higher false positive rate.
print("Precision:", precision_ score(y_test,y_pred, average="weighted"))
print("Recall:", recall_ score(y_test,y_pred, average="weighted"))

Output:
Precision: 0.9435897435897436
Recall: 0.9333333333333333

F1 score
The F1 score is the harmonic mean of precision and recall. It is seen that
during the precision-recall trade-off if we increase the precision, recall
decreases and vice versa. The goal of the F1 score is to combine
precision and recall.
F1 score = (2×Precision×Recall)/(Precision+Recall)

# calculating f1 score

print('F1 score:', f1_score(y_test, y_pred,

average="weighted"))

Output:
F1 score: 0.9327777777777778

Confusion Matrix
A confusion matrix is an N x N matrix where N is the number of target
classes. It represents the number of actual outputs and the predicted
outputs. Some terminologies in the matrix are as follows:
 True Positives: It is also known as TP. It is the output in which the
actual and the predicted values are YES.
 True Negatives: It is also known as TN. It is the output in which the
actual and the predicted values are NO.
 False Positives: It is also known as FP. It is the output in which the
actual value is NO but the predicted value is YES.
 False Negatives: It is also known as FN. It is the output in which the
actual value is YES but the predicted value is NO.
 Python3

confusion_matrix = metrics.confusion_matrix(y_test,

y_pred)

cm_display = metrics.ConfusionMatrixDisplay(

confusion_matrix=confusion_matrix,
display_labels=[0, 1, 2])

cm_display.plot()

plt.show()

OUTPUT:

15. Model Prediction:

 Definition: Model prediction refers to the process of using a trained model to make
predictions on new, unseen data.

trained model to make predictions or forecasts on new, unseen data. Here's a breakdown of
the model prediction process:

Types of Model Predictions

1. Classification Prediction: The model predicts a categorical label or class for a new input.

2. Regression Prediction: The model predicts a continuous value or outcome for a new
input.

3. Probability Prediction: The model predicts a probability distribution over possible


outcomes.

Steps in Model Prediction


1. Data Preparation: Preprocess the new input data to match the format and structure of the
training data.

2. Model Input: Feed the preprocessed input data into the trained model.

3. Model Prediction: The model generates a prediction or forecast based on the input data.

4. Post-processing: Optionally, perform additional processing on the predicted output, such


as rounding or thresholding.

Evaluation Metrics for Model Predictions

1. Accuracy: The proportion of correct predictions for classification tasks.

2. Mean Squared Error (MSE): The average squared difference between predicted and
actual values for regression tasks.

3. Mean Absolute Error (MAE): The average absolute difference between predicted and
actual values for regression tasks.

4. Precision: The proportion of true positives among all positive predictions for classification
tasks.

5. Recall: The proportion of true positives among all actual positive instances for
classification tasks.

Techniques for Improving Model Predictions

1. Regularization: Adding a penalty term to the loss function to prevent overfitting.

2. Early Stopping: Stopping the training process when the model's performance on the
validation set starts to degrade.

3. Ensemble Methods: Combining the predictions of multiple models to improve overall


performance.

4. Hyperparameter Tuning: Adjusting the model's hyperparameters to optimize its


performance.

16. Search and Learning:

 Search: Involves exploring a problem space to find the best solution. It can be applied
in algorithms like search trees or optimization tasks.
o Types:
 Depth-first Search (DFS): Explores as far as possible down one branch before
backtracking.
 Breadth-first Search (BFS): Explores all nodes at the present depth level
before moving on to nodes at the next depth level.

Search in machine learning refers to the process of finding the optimal solution
or configuration that maximizes the performance of a machine learning model.
Here are some key aspects of search in machine learning:
Types of Search

1. Hyperparameter Search: Finding the optimal hyperparameters for a machine learning


model, such as learning rate, regularization strength, or number of hidden layers.

2. Model Search: Selecting the best machine learning model from a set of candidate models,
such as decision trees, random forests, or neural networks.

3. Feature Search: Identifying the most relevant features or variables that contribute to the
performance of a machine learning model.

Search Algorithms

1. Grid Search: Exhaustively searching through a predefined grid of hyperparameters


to find the optimal combination.

2. Random Search: Randomly sampling hyperparameters from a predefined


distribution to find the optimal combination.

3. Bayesian Optimization: Using Bayesian inference to search for the optimal


hyperparameters by modeling the objective function as a probabilistic distribution.

4. Gradient-Based Optimization: Using gradient-based methods, such as gradient


descent, to search for the optimal hyperparameters.

Search Strategies

1. Exhaustive Search: Searching through all possible combinations of


hyperparameters or models.

2. Heuristic Search: Using heuristics or rules of thumb to guide the search process.

3. Metaheuristic Search: Using metaheuristics, such as genetic algorithms or particle


swarm optimization, to search for the optimal solution.

Learning: Machine learning algorithms improve through experience, refining their models
and predictions over time.

17. Data Sets:

A Dataset is a set of data grouped into a collection with which developers can
work to meet their goals. In a dataset, the rows represent the number of data
points and the columns represent the features of the Dataset. They are mostly
used in fields like machine learning, business, and government to gain insights,
make informed decisions, or train algorithms. Datasets may vary in size and
complexity and they mostly require cleaning and preprocessing to ensure data
quality and suitability for analysis or modeling.

Let us see an example below:


This is the Iris dataset. Since this is a dataset with which we build models, there
are input features and output features. Here:
 The input features are Sepal Length, Sepal Width, Petal Length, and Petal
Width.
 Species is the output feature.
Datasets can be stored in multiple formats. The most common ones are CSV,
Excel, JSON, and zip files for large datasets such as image datasets.
Datasets are used to train and test AI models, analyze trends, and gain insights
from data. They provide the raw material for computers to learn patterns and
make predictions.

Types of Datasets
There are various types of datasets available out there. They are:
 Numerical Dataset: They include numerical data points that can be
solved with equations. These include temperature, humidity, marks and
so on.
 Categorical Dataset: These include categories such as colour,
gender, occupation, games, sports and so on.
 Web Dataset: These include datasets created by calling APIs using
HTTP requests and populating them with values for data analysis.
These are mostly stored in JSON (JavaScript Object Notation) formats.
 Time series Dataset: These include datasets between a period, for
example, changes in geographical terrain over time.
 Image Dataset: It includes a dataset consisting of images. This is
mostly used to differentiate the types of diseases, heart conditions and
so on.
 Ordered Dataset: These datasets contain data that are ordered in
ranks, for example, customer reviews, movie ratings and so on.
 Partitioned Dataset: These datasets have data points segregated into
different members or different partitions.
 File-Based Datasets: These datasets are stored in files, in Excel
as .csv, or .xlsx files.
 Bivariate Dataset: In this dataset, 2 classes or features are directly
correlated to each other. For example, height and weight in a dataset
are directly related to each other.
 Multivariate Dataset: In these types of datasets, as the name
suggests 2 or more classes are directly correlated to each other. For
example, attendance, and assignment grades are directly correlated to
a student’s overall grade.

UNIT-2
Nearest Neighbor-Based Models in Machine Learning

Nearest neighbor-based models, especially the k-nearest neighbors (k-NN) algorithm, are a
fundamental class of machine learning techniques that rely on the idea of proximity or
similarity between data points for making predictions. These models are used in both
classification and regression tasks and are based on the idea that similar points are likely to
have similar labels or values. Below is a detailed breakdown of the key concepts related to
nearest neighbor-based models.

Nearest Neighbor-Based Models are non-parametric machine learning techniques that


classify data points or predict values based on the similarity (or distance) to nearby data
points. Here’s a detailed explanation of key concepts related to these models:

1. Introduction to Nearest Neighbor Models

 Nearest Neighbor methods rely on the idea that similar instances have similar outputs.
 Predictions are based on the proximity of a query point to data points in the training set.
 Common distance metrics:
o Euclidean Distance: d(x,x′)=∑i=1n(xi−xi′)2d(x, x') = \sqrt{\sum_{i=1}^n (x_i -

o Manhattan Distance: d(x,x′)=∑i=1n∣xi−xi′∣d(x, x') = \sum_{i=1}^n |x_i - x'_i|d(x,x


x'_i)^2}d(x,x′)=∑i=1n(xi−xi′)2

′)=∑i=1n∣xi−xi′∣
o Cosine Similarity: Measures the cosine of the angle between two vectors.

2. K-Nearest Neighbors (K-NN) for Classification

 K-NN is a simple, instance-based algorithm that classifies a data point based on the majority
label of its kkk-nearest neighbors.
 Algorithm:
1. Choose a value for kkk (number of neighbors).
2. Compute the distances between the query point and all points in the training set.
3. Identify the kkk closest points.
4. Assign the class label most common among these kkk points.
 Key Considerations:
1. Larger kkk smoothens the decision boundary (reduces variance but may increase
bias).
2. Small kkk may lead to overfitting.

3. K-NN for Regression

 Predicts the value of a query point as the average (or weighted average) of the values of its
kkk-nearest neighbors: y^=1k∑i=1kyi\hat{y} = \frac{1}{k} \sum_{i=1}^k y_iy^=k1i=1∑kyi
o Weighted K-NN: Weights closer points more heavily:
y^=∑i=1kwiyi∑i=1kwi,wi=1d(x,xi)\hat{y} = \frac{\sum_{i=1}^k w_i y_i}{\sum_{i=1}^k
w_i}, \quad w_i = \frac{1}{d(x, x_i)}y^=∑i=1kwi∑i=1kwiyi,wi=d(x,xi)1

4. Distance Metrics

 The choice of distance metric can significantly impact model performance:


o Euclidean: Best for continuous, normalized data.
o Manhattan: Works well for high-dimensional or grid-like data.

o Minkowski: Generalized distance metric: d(x,x′)=(∑i=1n∣xi−xi′∣p)1/pd(x, x') = \left( \


o Hamming: Suitable for categorical data.

sum_{i=1}^n |x_i - x'_i|^p \right)^{1/p}d(x,x′)=(i=1∑n∣xi−xi′∣p)1/p Special cases: p=2p


= 2p=2 (Euclidean), p=1p = 1p=1 (Manhattan).

5. Strengths and Weaknesses of K-NN

Strengths:

 Simple and Intuitive: No training phase (lazy learning).


 Versatile: Can handle classification and regression tasks.
 Adaptable: Works with different distance metrics.

Weaknesses:

 High Computation Cost: Requires computing distances to all training points at inference.
 Memory Intensive: Stores the entire training set.
 Curse of Dimensionality: Performance deteriorates in high-dimensional spaces.
 Sensitive to Noise: Outliers can affect predictions.

6. Applications of Nearest Neighbor Models

 Pattern Recognition: Handwriting and facial recognition.


 Recommender Systems: Collaborative filtering.
 Anomaly Detection: Identifying outliers.
 Medical Diagnosis: Classifying symptoms.

7. Improving K-NN

Scaling Features:

 Normalize or standardize features to ensure fair distance calculations.


Dimensionality Reduction:

 Techniques like PCA (Principal Component Analysis) reduce noise and improve performance.

Introduction to Proximity Measures

Proximity measures refer to techniques that quantify the similarity or closeness between two
data points. The idea is that if two data points are close to each other in the feature space,
they are more likely to belong to the same class (in classification) or have similar values (in
regression). These proximity measures are essential for nearest neighbor algorithms.

Distance Measures

Distance measures are specific types of proximity measures that calculate the physical
distance between data points in a multi-dimensional feature space. The choice of distance
measure can have a significant impact on the performance of the nearest neighbor algorithm.

1. Euclidean Distance:
o Formula: d(x,y)=∑i=1n(xi−yi)2d(\mathbf{x}, \mathbf{y}) = \sqrt{\
sum_{i=1}^{n} (x_i - y_i)^2}d(x,y)=i=1∑n(xi−yi)2
o The most common and widely used distance measure.
o It calculates the straight-line distance between two points in Euclidean space.
2. Manhattan Distance (also known as L1 distance or Taxicab Distance):
o Formula: d(x,y)=∑i=1n∣xi−yi∣d(\mathbf{x}, \mathbf{y}) = \sum_{i=1}^{n} |
x_i - y_i|d(x,y)=i=1∑n∣xi−yi∣
o The sum of the absolute differences between corresponding components.
3. Minkowski Distance:
o A generalization of Euclidean and Manhattan distance.
o Formula: d(x,y)=(∑i=1n∣xi−yi∣p)1/pd(\mathbf{x}, \mathbf{y}) = \left( \
sum_{i=1}^{n} |x_i - y_i|^p \right)^{1/p}d(x,y)=(i=1∑n∣xi−yi∣p)1/p
o When p=2p = 2p=2, it’s Euclidean distance; when p=1p = 1p=1, it’s
Manhattan distance.
4. Cosine Similarity:
o A measure of similarity between two non-zero vectors of an inner product
space.
o Formula: cosine similarity=x⋅y∥x∥∥y∥\text{cosine similarity} = \frac{\
mathbf{x} \cdot \mathbf{y}}{\|\mathbf{x}\| \|\
mathbf{y}\|}cosine similarity=∥x∥∥y∥x⋅y
o It measures the angle between the vectors, often used in text analysis or when
comparing documents.
5. Hamming Distance:
o Used for categorical data, particularly binary data.
o It counts the number of positions at which the corresponding elements of two
strings are different.

Non-Metric Similarity Functions


Not all similarity measures correspond to a "distance" in the strict mathematical sense. These
are non-metric similarity functions, often used in cases where the data is not naturally metric
(e.g., categorical or nominal data).

 Jaccard Similarity: Measures similarity between finite sample sets.

J(A,B)=∣A∩B∣∣A∪B∣J(A, B) = \frac{|A \cap B|}{|A \cup B|}J(A,B)=∣A∪B∣∣A∩B∣

o Used in binary and categorical data (e.g., comparing sets).


 Pearson Correlation Coefficient: Measures linear correlation between two variables.

ρ=∑(xi−xˉ)(yi−yˉ)∑(xi−xˉ)2∑(yi−yˉ)2\rho = \frac{\sum (x_i - \bar{x})(y_i - \


bar{y})}{\sqrt{\sum (x_i - \bar{x})^2 \sum (y_i - \bar{y})^2}}ρ=∑(xi−xˉ)2∑(yi−yˉ)2
∑(xi−xˉ)(yi−yˉ)

o Used in continuous data to measure similarity between variables.

Proximity Between Binary Patterns

For binary patterns (data with values of 0 or 1), the distance or similarity can be computed
using measures that specifically handle binary data:

 Hamming Distance: The number of differing bits between two binary patterns.
 Jaccard Similarity: Works well for comparing binary sets, focusing on the
intersection over the union of two sets.

Different Classification Algorithms Based on the Distance Measures

Several classification algorithms rely on proximity measures to classify data points:

1. k-Nearest Neighbors Classifier (k-NN):


o This is the most common classification algorithm based on distance measures.
o Algorithm:
1. Choose the number of neighbors kkk.
2. For a new data point, calculate the distance to all training data points.
3. Identify the kkk-nearest neighbors.
4. Assign the class label based on the majority vote of the kkk neighbors.
2. Radius Nearest Neighbor (RNN):
o Similar to k-NN but instead of selecting a fixed number of nearest neighbors, a
radius rrr is specified.
o Algorithm:
1. Choose a radius rrr.
2. For a new data point, find all training points within radius rrr.
3. Assign the class label based on the majority vote of all points within
the radius.
o This method is useful when data is sparse and you want to avoid having a
fixed kkk.
3. Weighted k-NN:
o A variation where closer neighbors have more influence on the decision.
o For example, use inverse distance weighting to weight closer neighbors
higher.
K-Nearest Neighbor Classifier

The k-Nearest Neighbors (k-NN) algorithm is one of the most straightforward and widely
used classification algorithms. It classifies new data points based on the majority class among
its kkk-nearest neighbors.

As an example, consider the following table of data points containing two


features:

Algorithm Steps

Input:

 Training dataset: D={(x1,y1),(x2,y2),…,(xn,yn)}D = \{(x_1, y_1), (x_2, y_2), \dots, (x_n,


y_n)\}D={(x1,y1),(x2,y2),…,(xn,yn)}
 Query point: xqx_qxq
 Number of neighbors: kkk
 Distance metric: d(xi,xj)d(x_i, x_j)d(xi,xj) (e.g., Euclidean, Manhattan).

Procedure:

1. Compute Distances:
o Calculate the distance d(xq,xi)d(x_q, x_i)d(xq,xi) between the query point xqx_qxq
and each data point xix_ixi in the training set.
2. Sort Neighbors:
o Sort the training points by their distance to xqx_qxq.
3. Select Nearest Neighbors:
o Select the kkk-nearest neighbors (smallest distances).
4. Majority Vote:
o Determine the majority class label among the kkk-nearest neighbors.
5. Output:
o Assign the query point xqx_qxq to the majority class.

3. Distance Metrics

1. Euclidean Distance (default for continuous features): d(xi,xj)=∑k=1n(xik−xjk)2d(x_i, x_j) = \


sqrt{\sum_{k=1}^n (x_{ik} - x_{jk})^2}d(xi,xj)=k=1∑n(xik−xjk)2
2. Manhattan Distance: d(xi,xj)=∑k=1n∣xik−xjk∣d(x_i, x_j) = \sum_{k=1}^n |x_{ik} - x_{jk}|d(xi,xj
)=k=1∑n∣xik−xjk∣
3. Cosine Similarity (for text or high-dimensional data): Cosine similarity=x⃗i⋅x⃗j∥x⃗i∥∥x⃗j∥\
text{Cosine similarity} = \frac{\vec{x}_i \cdot \vec{x}_j}{\|\vec{x}_i\| \|\
vec{x}_j\|}Cosine similarity=∥xi∥∥xj∥xi⋅xj Lower similarity corresponds to greater distance.
4. Hamming Distance (for categorical data):
d(xi,xj)=Number of differing attributes between xi and xjd(x_i, x_j) = \text{Number of
differing attributes between } x_i \text{ and } x_jd(xi,xj
)=Number of differing attributes between xi and xj

4. Choosing kkk

 Small kkk:
o Captures local patterns but may overfit (sensitive to noise).
 Large kkk:
o Smoothens predictions but may underfit (ignores local structures).
 Optimal kkk is usually determined via cross-validation.

Example

Dataset:

Training data:

D={(1,A),(2,A),(4,B),(5,B),(6,A)}D = \{(1, A), (2, A), (4, B), (5, B), (6, A)\}D={(1,A),(2,A),(4,B),(5,B),(6,A)}

Query point: xq=3x_q = 3xq=3, k=3k = 3k=3.

Steps:

1. Compute Distances:
o d(1,3)=2,d(2,3)=1,d(4,3)=1,d(5,3)=2,d(6,3)=3d(1, 3) = 2, d(2, 3) = 1, d(4, 3) = 1, d(5, 3)
= 2, d(6, 3) = 3d(1,3)=2,d(2,3)=1,d(4,3)=1,d(5,3)=2,d(6,3)=3.
2. Sort Neighbors:
o Sorted distances: (2,A),(4,B),(1,A),(5,B),(6,A)(2, A), (4, B), (1, A), (5, B), (6, A)(2,A),
(4,B),(1,A),(5,B),(6,A).
3. Select k=3k = 3k=3 Nearest Neighbors:
o Neighbors: (2,A),(4,B),(1,A)(2, A), (4, B), (1, A)(2,A),(4,B),(1,A).
4. Majority Vote:
o Class labels: A,B,AA, B, AA,B,A.
o Majority class: AAA.
5. Prediction:
o Assign xq=3x_q = 3xq=3 to class AAA.

Radius Distance Nearest Neighbor Algorithm

The Radius Nearest Neighbor (RNN) algorithm is an extension of the k-NN approach but
with a focus on the radius rather than the number of neighbors.

Algorithm Steps
Input:

 Training data: D={(x1,y1),(x2,y2),…,(xn,yn)}D = \{(x_1, y_1), (x_2, y_2), \dots, (x_n,


y_n)\}D={(x1,y1),(x2,y2),…,(xn,yn)}
 Query point: xqx_qxq
 Radius: rrr
 Distance metric: d(xi,xq)d(x_i, x_q)d(xi,xq) (e.g., Euclidean, Manhattan).

Procedure:

1. Compute Distances:
o Calculate the distance d(xi,xq)d(x_i, x_q)d(xi,xq) between the query point xqx_qxq
and each data point xix_ixi in the training set.
2. Identify Neighbors:
o Select all points xix_ixi such that d(xi,xq)≤rd(x_i, x_q) \leq rd(xi,xq)≤r.
3. Prediction:
o Classification:
 Use a majority vote among the labels of the selected neighbors.
 If no neighbors are found within rrr, a default class can be assigned, or a
larger radius can be used.
o Regression:
 Compute the mean (or weighted mean) of the target values yiy_iyi of the
selected neighbors.

Output:

 Predicted class or value for xqx_qxq.

3. Pseudocode
plaintext
Copy code
Algorithm RDNN:
Input: Training data D = {(x1, y1), (x2, y2), ..., (xn, yn)}, query point
x_q, radius r, distance metric d.

1. Initialize an empty list of neighbors: Neighbors = []


2. For each (x_i, y_i) in D:
Compute distance: dist = d(x_i, x_q)
If dist ≤ r:
Add (x_i, y_i) to Neighbors
3. If Neighbors is empty:
Return default prediction (or use larger radius)
4. If classification:
Return majority label among Neighbors
Else if regression:
Return mean(target values of Neighbors)

KNN Regression

In k-Nearest Neighbors Regression (k-NN regression), the output is continuous rather than
categorical. The prediction is made based on the average (or weighted average) of the target
values of the kkk-nearest neighbors.

Steps of the Algorithm


Input:

1. Training dataset D={(x1,y1),(x2,y2),…,(xn,yn)}D = \{(x_1, y_1), (x_2, y_2), \dots, (x_n,


y_n)\}D={(x1,y1),(x2,y2),…,(xn,yn)}:
o xix_ixi represents the feature vector of the ithi^{th}ith data point.
o yiy_iyi represents the target value of the ithi^{th}ith data point.
2. Query point xqx_qxq.
3. Number of neighbors kkk (a hyperparameter).
4. A distance metric (e.g., Euclidean distance, Manhattan distance).

Procedure:

1. Compute Distances:
o Calculate the distance d(xq,xi)d(x_q, x_i)d(xq,xi) between the query point xqx_qxq
and each training point xix_ixi.
2. Find kkk-Nearest Neighbors:
o Identify the kkk-nearest neighbors by selecting the kkk points with the smallest
distances to xqx_qxq.
3. Aggregate Neighbor Target Values:
o Simple Average:
 Predict the target value as the mean of the target values of the kkk-nearest
neighbors: y^q=1k∑i=1kyi\hat{y}_q = \frac{1}{k} \sum_{i=1}^k y_iy^q=k1
i=1∑kyi
o Weighted Average:
 Assign weights inversely proportional to the distances:
y^q=∑i=1kwiyi∑i=1kwi,wi=1d(xq,xi)+ϵ\hat{y}_q = \frac{\sum_{i=1}^k w_i y_i}
{\sum_{i=1}^k w_i}, \quad w_i = \frac{1}{d(x_q, x_i) + \epsilon}y^q=∑i=1kwi
∑i=1kwiyi,wi=d(xq,xi)+ϵ1 Here, ϵ\epsilonϵ is a small constant to prevent
division by zero.

Output:

 Predicted target value y^q\hat{y}_qy^q for the query point xqx_qxq.

Pseudocode
plaintext
Copy code
Algorithm KNN-Regression:
Input: Training data D = {(x1, y1), ..., (xn, yn)}, query point x_q, number
of neighbors k, distance metric d.

1. Compute distances:
For each (x_i, y_i) in D:
dist[i] = d(x_i, x_q)

2. Sort training data by distances:


Sort D based on ascending values of dist.

3. Select k-nearest neighbors:


Neighbors = first k points from sorted D.

4. Aggregate target values:


If using simple average:
Predicted value = Mean(target values of Neighbors)
If using weighted average:
Predicted value = Weighted Mean(target values of Neighbors)

5. Return Predicted value.


Performance of Classifiers

The performance of nearest neighbor classifiers depends on several factors:

1. Choice of kkk: Too small a kkk can lead to noisy predictions (overfitting), while too
large a kkk can make the model too smooth (underfitting).
2. Distance Measure: The choice of distance measure can heavily affect the accuracy of
the classifier, especially when features are on different scales.
3. Dimensionality: As the number of features (dimensions) increases, the model can
struggle due to the curse of dimensionality.

Performance of Regression Algorithms

Evaluating the performance of regression algorithms is essential to understand how well the model
predicts continuous target variables. The choice of metric depends on the problem's context and the
specific goals, such as minimizing large errors, understanding variability, or achieving interpretable
results.

1. Common Regression Metrics

1.1 Mean Absolute Error (MAE)

 Measures the average magnitude of absolute errors between predicted values (y^i\
hat{y}_iy^i) and actual values (yiy_iyi).
 Formula: MAE=1n∑i=1n∣yi−y^i∣\text{MAE} = \frac{1}{n} \sum_{i=1}^n |y_i - \hat{y}_i|
MAE=n1i=1∑n∣yi−y^i∣
 Advantages:
o Simple to interpret.
o Less sensitive to outliers than Mean Squared Error.
 Disadvantages:
o Doesn't penalize large errors as heavily as other metrics.

1.2 Mean Squared Error (MSE)

 Measures the average of the squared differences between predicted and actual values.
 Formula: MSE=1n∑i=1n(yi−y^i)2\text{MSE} = \frac{1}{n} \sum_{i=1}^n (y_i - \
hat{y}_i)^2MSE=n1i=1∑n(yi−y^i)2
 Advantages:
o Penalizes large errors more than small ones.
 Disadvantages:
o Can be heavily influenced by outliers.
o Units are squared, making interpretation less intuitive.

1.3 Root Mean Squared Error (RMSE)

 The square root of MSE; expresses error in the same units as the target variable.
 Formula: RMSE=1n∑i=1n(yi−y^i)2\text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^n (y_i - \
hat{y}_i)^2}RMSE=n1i=1∑n(yi−y^i)2
 Advantages:
o Penalizes large errors more than MAE.
o Easier to interpret than MSE.
 Disadvantages:
o Still sensitive to outliers.

1.4 R-Squared (R2R^2R2) or Coefficient of Determination

 Measures the proportion of variance in the target variable explained by the model.
 Formula: R2=1−∑i=1n(yi−y^i)2∑i=1n(yi−yˉ)2R^2 = 1 - \frac{\sum_{i=1}^n (y_i - \hat{y}_i)^2}{\
sum_{i=1}^n (y_i - \bar{y})^2}R2=1−∑i=1n(yi−yˉ)2∑i=1n(yi−y^i)2 Where yˉ\bar{y}yˉ is the
mean of the actual values.
 Range:
o R2=1R^2 = 1R2=1: Perfect fit.
o R2=0R^2 = 0R2=0: Model predicts as well as the mean.
o R2<0R^2 < 0R2<0: Model performs worse than the mean prediction.
 Advantages:
o Intuitive measure of goodness-of-fit.

1. .

UNIT-3
Models Based on Decision Trees

1. Decision Trees for Classification

 Decision Trees are supervised learning models used for classification and regression tasks.
 Working Principle:
o They split the data into subsets based on the feature that provides the most
homogeneous subsets (purity of classes).
o Each internal node represents a test on a feature, branches represent the outcome
of the test, and leaves represent class labels.

2. Impurity Measures

Impurity measures quantify how "mixed" the data is in terms of target classes:

 Gini Index: Measures the likelihood of incorrect classification.

G=1−∑i=1kpi2G = 1 - \sum_{i=1}^k p_i^2G=1−i=1∑kpi2

Where pip_ipi is the probability of class iii. Lower Gini indicates purer nodes.

 Entropy: Measures the amount of information needed to classify a data point.

H=−∑i=1kpilog⁡2(pi)H = -\sum_{i=1}^k p_i \log_2(p_i)H=−i=1∑kpilog2(pi)

Nodes with lower entropy are preferred.


3. Properties of Decision Trees

 Advantages:
o Easy to interpret and visualize.
o Handles both numerical and categorical data.
o Non-parametric (does not assume a distribution of the data).

 Disadvantages:
o Prone to overfitting, especially with deep trees.
o Sensitive to small changes in data (instability).

4. Regression Based on Decision Trees

 In regression, decision trees predict continuous values.


 Splits minimize the mean squared error (MSE) or mean absolute error (MAE) to achieve
homogeneous subsets.

5. Bias–Variance Trade-off

 Bias: Error due to overly simplistic models (underfitting).


 Variance: Error due to model sensitivity to small changes in training data (overfitting).
 Decision trees tend to have low bias but high variance. Regularization methods like limiting
tree depth and pruning help reduce variance.

6. Random Forests for Classification and Regression

 A Random Forest is an ensemble of decision trees where:


o Each tree is trained on a bootstrapped sample of the data.
o Features for splitting are randomly selected at each node.
 Advantages:
o Combines multiple trees to reduce overfitting (low variance) while maintaining
accuracy (low bias).
o Robust to noise and works well with large datasets.

The Bayes Classifier

1. Introduction to the Bayes Classifier

 The Bayes classifier is based on Bayes' theorem and predicts the class that maximizes the
posterior probability: Class=arg⁡max⁡kP(Ck∣X)\text{Class} = \arg\max_{k} P(C_k \mid
X)Class=argkmaxP(Ck∣X)

2. Bayes’ Rule and Inference

Bayes' rule provides a way to update probabilities based on evidence:


P(Ck∣X)=P(X∣Ck)P(Ck)P(X)P(C_k \mid X) = \frac{P(X \mid C_k) P(C_k)}{P(X)}P(Ck∣X)=P(X)P(X∣Ck)P(Ck)

 P(Ck∣X)P(C_k \mid X)P(Ck∣X): Posterior probability (class given data).


 P(X∣Ck)P(X \mid C_k)P(X∣Ck): Likelihood (data given class).
 P(Ck)P(C_k)P(Ck): Prior probability (class before data).
 P(X)P(X)P(X): Evidence (overall data distribution).

3. The Bayes Classifier and Its Optimality

 The Bayes classifier is optimal in minimizing the classification error if the true probabilities
P(Ck∣X)P(C_k \mid X)P(Ck∣X) are known.
 In practice, these probabilities are often estimated from the data.

4. Multi-Class Classification

 Extends the Bayes classifier to handle multiple classes by computing posterior probabilities
for each class and choosing the one with the highest value.

5. Class Conditional Independence and Naive Bayes Classifier (NBC)

 Class Conditional Independence: Assumes that features X1,X2,…,XnX_1, X_2, \ldots, X_nX1
,X2,…,Xn are independent given the class CCC: P(X∣C)=∏i=1nP(Xi∣C)P(X \mid C) = \
prod_{i=1}^n P(X_i \mid C)P(X∣C)=i=1∏nP(Xi∣C)
 Naive Bayes Classifier:
o Based on the independence assumption, it calculates posterior probabilities
efficiently.
o Works well for text classification and spam filtering, despite its simplicity.
o Advantages: Fast, scalable, and effective with high-dimensional data.
o Disadvantages: Assumption of feature independence may not hold in practice.

UNIT-4
1. Introduction to Linear Discriminants

 Linear discriminants separate data points using a hyperplane.


 The decision boundary is defined as: wTx+b=0w^T x + b = 0wTx+b=0 where
www is the weight vector, xxx is the input, and bbb is the bias term.
 Applications: Classification tasks where data is linearly separable.

2. Linear Discriminants for Classification

 A linear discriminant divides the feature space into regions corresponding to


different classes.
 For two classes, a simple discriminant function is: y=sign(wTx+b)y = \
text{sign}(w^T x + b)y=sign(wTx+b) y=+1y = +1y=+1 or y=−1y = -1y=−1,
depending on which side of the hyperplane a point lies.

3. Perceptron Classifier

 A perceptron is a linear classifier that assigns labels based on:


y=sign(wTx+b)y = \text{sign}(w^T x + b)y=sign(wTx+b)
 It works for linearly separable data and uses a simple learning algorithm to find
the hyperplane.

4. Perceptron Learning Algorithm

 Goal: Adjust weights www and bias bbb to correctly classify all training data.
 Algorithm:
1. Initialize w=0w = 0w=0, b=0b = 0b=0.
2. For each misclassified point xix_ixi:
 Update www: w=w+ηyixiw = w + \eta y_i x_iw=w+ηyixi
 Update bbb: b=b+ηyib = b + \eta y_ib=b+ηyi
3. Repeat until no misclassifications. η\etaη is the learning rate.

5. Support Vector Machines (SVMs)

 SVMs find the hyperplane that maximizes the margin (distance between the
hyperplane and the nearest points of each class).
 Objective: Minimize 12∥w∥2subject to yi(wTxi+b)≥1∀i\text{Minimize } \
frac{1}{2} \| w \|^2 \quad \text{subject to } y_i(w^T x_i + b) \geq 1 \quad \
forall iMinimize 21∥w∥2subject to yi(wTxi+b)≥1∀i
 Results in better generalization compared to perceptrons.

6. Linearly Non-Separable Case

 For non-separable data, SVMs introduce slack variables (ξi\xi_iξi) to allow


some misclassification: yi(wTxi+b)≥1−ξiand ξi≥0y_i(w^T x_i + b) \geq 1 - \
xi_i \quad \text{and } \xi_i \geq 0yi(wTxi+b)≥1−ξiand ξi≥0
 Minimization becomes: Minimize 12∥w∥2+C∑ξi\text{Minimize } \frac{1}
{2} \| w \|^2 + C \sum \xi_iMinimize 21∥w∥2+C∑ξi where CCC controls the
trade-off between margin width and misclassification.

7. Non-Linear SVM

 Non-linear data can be classified by mapping it to a higher-dimensional space


where it becomes linearly separable.
 Mapping function: ϕ(x):Rn→Rm\phi(x): \mathbb{R}^n \to \
mathbb{R}^mϕ(x):Rn→Rm, where m>nm > nm>n.
8. Kernel Trick

 Explicitly mapping xxx to ϕ(x)\phi(x)ϕ(x) can be computationally expensive.


 The kernel trick computes ϕ(xi)Tϕ(xj)\phi(x_i)^T \phi(x_j)ϕ(xi)Tϕ(xj) directly
using a kernel function K(xi,xj)K(x_i, x_j)K(xi,xj).
 Common kernels:
o Polynomial kernel: K(x,x′)=(xTx′+1)dK(x, x') = (x^T x' + 1)^dK(x,x
′)=(xTx′+1)d
o Gaussian (RBF) kernel: K(x,x′)=exp⁡(−∥x−x′∥2/2σ2)K(x, x') = \exp(-\|x -
x'\|^2 / 2\sigma^2)K(x,x′)=exp(−∥x−x′∥2/2σ2)

9. Logistic Regression

 Logistic regression models the probability of a binary outcome using the sigmoid
function: P(y=1∣x)=11+exp⁡(−wTx−b)P(y = 1 \mid x) = \frac{1}{1 + \exp(-w^T x -
b)}P(y=1∣x)=1+exp(−wTx−b)1
 Decision boundary: wTx+b=0w^T x + b = 0wTx+b=0.
 Optimized using maximum likelihood estimation.

10. Linear Regression

 Linear regression predicts a continuous output using a linear combination of


input features: y=wTx+by = w^T x + by=wTx+b
 Objective: Minimize the mean squared error (MSE): MSE=1n∑i=1n(yi−
(wTxi+b))2\text{MSE} = \frac{1}{n} \sum_{i=1}^n (y_i - (w^T x_i +
b))^2MSE=n1i=1∑n(yi−(wTxi+b))2

11. Multi-Layer Perceptrons (MLPs)

 An MLP is a type of neural network with:


o Input layer, one or more hidden layers, and an output layer.
o Each layer has neurons connected with weights and biases.
 Non-linear activation functions (e.g., ReLU, sigmoid) allow MLPs to model
complex relationships.

12. Backpropagation for Training an MLP

 Backpropagation adjusts weights to minimize the error (e.g., cross-entropy or


MSE):
o Forward Pass: Compute output and loss for given weights.
o Backward Pass:
 Compute gradients of the loss with respect to weights using the
chain rule.
 Update weights using gradient descent or its variants (e.g.,
Adam).
o Repeat for multiple epochs until convergence.
UNIT-5
1. Introduction to Clustering
The task of grouping data points based on their similarity with each other is
called Clustering or Cluster Analysis. This method is defined under the branch
of Unsupervised Learning , which aims at gaining insights from unlabelled data
points, that is, unlike supervised learning we don’t have a target variable.
Clustering aims at forming groups of homogeneous data points from a
heterogeneous dataset. It evaluates the similarity based on a metric like Euclidean
distance, Cosine similarity, Manhattan distance, etc. and then group the points
with highest similarity score together.

For Example, In the graph given below, we can clearly see that there are 3
circular clusters forming on the basis of distance.

Now it is not necessary that the clusters formed must be circular in shape. The
shape of clusters can be arbitrary. There are many algortihms that work well with
detecting arbitrary shaped clusters.

For example, In the below given graph we can see that the clusters formed are not
circular in shape.
Types of Clustering Algorithms

At the surface level, clustering helps in the analysis of unstructured data. Graphing, the
shortest distance, and the density of the data points are a few of the elements that influence
cluster formation. Clustering is the process of determining how related the objects are based
on a metric called the similarity measure. Similarity metrics are easier to locate in smaller
sets of features. It gets harder to create similarity measures as the number of features
increases. Depending on the type of clustering algorithm being utilized in data mining,
several techniques are employed to group the data from the datasets. In this part, the
clustering techniques are described. Various types of clustering algorithms are:
1. Centroid-based Clustering (Partitioning methods)
2. Density-based Clustering (Model-based methods)
3. Connectivity-based Clustering (Hierarchical clustering)
4. Distribution-based Clustering
We will be going through each of these types in brief.
1. Centroid-based Clustering (Partitioning methods)
Partitioning methods are the most easiest clustering algorithms. They group data points on
the basis of their closeness. Generally, the similarity measure chosen for these algorithms
are Euclidian distance, Manhattan Distance or Minkowski Distance. The datasets are
separated into a predetermined number of clusters, and each cluster is referenced by a
vector of values. When compared to the vector value, the input data variable shows no
difference and joins the cluster.
The primary drawback for these algorithms is the requirement that we establish the number
of clusters, “k,” either intuitively or scientifically (using the Elbow Method) before any
clustering machine learning system starts allocating the data points. Despite this, it is still
the most popular type of clustering. K-means and K-medoids clustering are some examples
of this type clustering.
2. Density-based Clustering (Model-based methods)
Density-based clustering, a model-based method, finds groups based on the density of data
points. Contrary to centroid-based clustering, which requires that the number of clusters be
predefined and is sensitive to initialization, density-based clustering determines the number
of clusters automatically and is less susceptible to beginning positions. They are great at
handling clusters of different sizes and forms, making them ideally suited for datasets with
irregularly shaped or overlapping clusters. These methods manage both dense and sparse
data regions by focusing on local density and can distinguish clusters with a variety of
morphologies.

2. Partitioning of Data

Partitioning Method: This clustering method classifies the information into


multiple groups based on the characteristics and similarity of the data. Its the data
analysts to specify the number of clusters that has to be generated for the
clustering methods. In the partitioning method when database(D) that contains
multiple(N) objects then the partitioning method constructs user-specified(K)
partitions of the data in which each partition represents a cluster and a particular
region. There are many algorithms that come under partitioning method some of
the popular ones are K-Mean, PAM(K-Medoids), CLARA algorithm (Clustering
Large Applications) etc. In this article, we will be seeing the working of K Mean
algorithm in detail.
K-Mean (A centroid based Technique): The K means algorithm takes the input
parameter K from the user and partitions the dataset containing N objects into K
clusters so that resulting similarity among the data objects inside the group
(intracluster) is high but the similarity of data objects with the data objects from
outside the cluster is low (intercluster). The similarity of the cluster is determined
with respect to the mean value of the cluster. It is a type of square error
algorithm. At the start randomly k objects from the dataset are chosen in which
each of the objects represents a cluster mean(centre). For the rest of the data
objects, they are assigned to the nearest cluster based on their distance from the
cluster mean. The new mean of each of the cluster is then calculated with the
added data objects.
Algorithm:
K mean:
Input:
K: The number of clusters in which the dataset has to be divided
D: A dataset containing N number of objects

Output:
A dataset of K clusters
Method:
1. Randomly assign K objects from the dataset(D) as cluster centres(C)
2. (Re) Assign each object to which object is most similar based upon mean
values.
3. Update Cluster means, i.e., Recalculate the mean of each cluster with the
updated values.
4. Repeat Step 2 until no change occurs.
Figure – K-mean Clustering
Example: Suppose we want to group the visitors to a website using just their age
as follows:
16, 16, 17, 20, 20, 21, 21, 22, 23, 29, 36, 41, 42, 43, 44, 45, 61, 62, 66
Initial Cluster:
K=2
Centroid(C1) = 16 [16]
Centroid(C2) = 22 [22]
Note: These two points are chosen randomly from the dataset.
Iteration-1:
C1 = 16.33 [16, 16, 17]
C2 = 37.25 [20, 20, 21, 21, 22, 23, 29, 36, 41, 42, 43, 44, 45, 61, 62, 66]
Iteration-2:
C1 = 19.55 [16, 16, 17, 20, 20, 21, 21, 22, 23]
C2 = 46.90 [29, 36, 41, 42, 43, 44, 45, 61, 62, 66]
Iteration-3:
C1 = 20.50 [16, 16, 17, 20, 20, 21, 21, 22, 23, 29]
C2 = 48.89 [36, 41, 42, 43, 44, 45, 61, 62, 66]
Iteration-4:
C1 = 20.50 [16, 16, 17, 20, 20, 21, 21, 22, 23, 29]
C2 = 48.89 [36, 41, 42, 43, 44, 45, 61, 62, 66]
No change Between Iteration 3 and 4, so we stop. Therefore we get the
clusters (16-29) and (36-66) as 2 clusters we get using K Mean Algorithm.

3. Matrix Factorization

Matrix factorization is a simple embedding model. Given the feedback matrix A ∈Rm×n,
where m is the number of users (or queries) and n is the number of items, the model learns:
 A user embedding matrix U∈Rm×d, where row i is the embedding for user i.
 An item embedding matrix V∈Rn×d, where row j is the embedding for item j.

feedback matrix A. Observe that the (i,j) entry of U.VT is simply the dot product ⟨Ui,Vj⟩ of
The embeddings are learned such that the product UVT is a good approximation of the

the embeddings of user i and item j, which you want to be close to Ai,j.
 Note: Matrix factorization typically gives a more compact representation than learning the
full matrix. The full matrix has O(nm) entries, while the embedding
matrices U, V have O((n+m)d) entries, where the embedding dimension d is typically much
smaller than m and n. As a result, matrix factorization finds latent structure in the data,
assuming that observations lie close to a low-dimensional subspace. In the preceding
example, the values of n, m, and d are so low that the advantage is negligible. In real-world
recommendation systems, however, matrix factorization can be significantly more compact
than learning the full matrix.
Choosing the objective function

One intuitive objective function is the squared distance. To do this, minimize the sum of
squared errors over all pairs of observed entries:

minU∈Rm×d, V∈Rn×d∑(i,j)∈obs(Aij−⟨Ui,Vj⟩)2.

In this objective function, you only sum over observed pairs (i, j), that is, over non-zero
values in the feedback matrix. However, only summing over values of one is not a good idea
—a matrix of all ones will have a minimal loss and produce a model that can't make effective
recommendations and that generalizes poorly.

Perhaps you could treat the unobserved values as zero,


and sum over all entries in the matrix. This corresponds to minimizing the
squared Frobenius distance between A and its approximation UVT:
minU∈Rm×d, V∈Rn×d‖A−UVT‖F2.
You can solve this quadratic problem through Singular Value Decomposition (SVD) of the
matrix. However, SVD is not a great solution either, because in real applications, the
matrix A may be very sparse. For example, think of all the videos on YouTube compared to
all the videos a particular user has viewed. The solution UVT (which corresponds to the
model's approximation of the input matrix) will likely be close to zero, leading to poor
generalization performance.

In contrast, Weighted Matrix Factorization decomposes the objective into the following
two sums:

 A sum over observed entries.


 A sum over unobserved entries (treated as zeroes).
minU∈Rm×d, V∈Rn×d∑(i,j)∈obs(Aij−⟨Ui,Vj⟩)2+w0∑(i,j)∉obs(⟨Ui,Vj⟩)2.
Here, w0 is a hyperparameter that weights the two terms so that the objective is not
dominated by one or the other. Tuning this hyperparameter is very important.

4. Clustering of Patterns

 Identifies groups (clusters) in data where patterns within the same cluster are similar, and
patterns in different clusters are dissimilar.
 Uses distance metrics like Euclidean, Manhattan, or cosine similarity.

5. Divisive Clustering

 A top-down hierarchical clustering approach:


o Starts with all data in one cluster.
o Recursively splits clusters until each data point is in its cluster or a stopping criterion
is met.
 Computationally intensive but provides a global view.

Divisive Clustering is the technique that starts with all data points in a single
cluster and recursively splits the clusters into smaller sub-clusters based on their
dissimilarity. It is also known as, “top-down” clustering. It starts with all data
points in a single cluster, and then recursively splits the clusters into smaller sub-
clusters based on their dissimilarity.
Unlike agglomerative clustering, which starts with each data point as its own
cluster and iteratively merges the most similar pairs of clusters, divisive
clustering is a “divide and conquer” approach that breaks a large cluster into
smaller sub-clusters

Example 1:

Here is a short example of agglomerative clustering using randomly generated


data in Python –
In this example, we first create a random dataset with 50 samples and two
features using NumPy’s randn function. Then, we use the linkage function from
SciPy’s cluster.hierarchy module to perform hierarchical clustering using
complete linkage method. The resulting linkage matrix Z contains information
about the cluster merging process.
Finally, we plot the dendrogram using the dendrogram function from the same
module. The dendrogram shows how the clusters were merged and at what
distance, starting from individual samples at the bottom and ending with a single
cluster at the top.

 Python

# Import the necessary libraries


import numpy as np

import matplotlib.pyplot as plt

from scipy.cluster.hierarchy import dendrogram, linkage

# Create a random dataset with two features and 50 samples

np.random.seed(0)

x = np.random.randn(20, 2)

# Perform hierarchical clustering using caomplete linkage

z = linkage(x, method='complete')

# Plot the dendrogram

plt.figure(figsize=(10, 5))

plt.title('Agglomerative Clustering Dendrogram')

plt.xlabel('Sample index')

plt.ylabel('Distance')

dendrogram(z)

plt.show()

Output:
Agglomerative Clustering

Note that there are other methods of linkage that can be used, such as single,
average, and ward. The method used will affect the resulting dendrogram and
clustering.

Example 2:

Here is a short example of agglomerative clustering using randomly generated


data in Python –
In this example, we first generate a sample dataset with 10 data points and 2
features using NumPy. We then perform agglomerative clustering using the
linkage function from SciPy, which takes the data matrix as input along with the
clustering method (ward in this case) and distance metric (euclidean in this case).
The output of linkage is a linkage matrix that represents the hierarchical
clustering structure.
We then plot the dendrogram using the dendrogram function from SciPy, which
takes the linkage matrix as input along with the plotting axis (ax). We set the
color threshold to 0 to display all clusters in the dendrogram.

 Python

# Import the necessary libraries

from scipy.cluster.hierarchy import dendrogram, linkage

import numpy as np

import matplotlib.pyplot as plt


# Generate sample data

np.random.seed(0)

X = np.random.randn(20, 2)

# Perform divisive clustering

Z = linkage(X, method='ward', metric='euclidean')

# Plot dendrogram

# Plot the dendrogram

plt.figure(figsize=(10, 5))

plt.title('Agglomerative Clustering Dendrogram')

plt.xlabel('Sample index')

plt.ylabel('Distance')

dendrogram(Z)

plt.show()

Output:
Agglomerative clustering

The resulting plot will show the hierarchical clustering structure with each data
point as a leaf node in the dendrogram and the clusters at different levels of the
hierarchy.

6. Agglomerative Clustering

 A bottom-up hierarchical clustering approach:


o Starts with each data point as a separate cluster.
o Merges the two closest clusters iteratively based on a linkage criterion:
 Single Linkage: Closest pair of points.
 Complete Linkage: Farthest pair of points.
 Average Linkage: Mean distance between points.
o Stops when all data points form a single cluster or a criterion is met.
o Agglomerative clustering is a hierarchical clustering algorithm
that starts with each data point as its own cluster and iteratively
merges the closest clusters until a stopping criterion is reached.
It is a bottom-up approach that produces a dendrogram, which is
a tree-like diagram that shows the hierarchical relationship
between the clusters. The algorithm can be implemented using
the scikit-learn library in Python

Agglomerative Clustering Algorithm

Agglomerative Clustering is a hierarchical algorithm that creates a nested


hierarchy of clusters by merging clusters in a bottom-up approach. This
algorithm includes the following steps −

 Treat each data point as a single cluster


 Compute the proximity matrix using a distance metric
 Merge clusters based on a linkage criterion
 Update the distance matrix
 Repeat steps 3 and 4 until a single cluster remains
Why use Agglomerative Clustering?
The Agglomerative clustering allows easy interpretation of relationships
between data points. Unlike k-means clustering, we do not need to specify
the number of clusters. It is very efficient and can identify small clusters.

Implementation of Agglomerative Clustering in Python

We will use the iris dataset for demonstration. The first step is to import the
necessary libraries and load the dataset.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage iris = load_iris()
X = iris.data
y = iris.target
The next step is to create a linkage matrix that contains the distances
between each pair of clusters. We can use the linkage function from the
scipy.cluster.hierarchy module to create the linkage matrix.
Z = linkage(X, 'ward')

The 'ward' method is used to calculate the distances between the clusters. It
minimizes the variance of the distances between the clusters being merged.

We can visualize the dendrogram using the dendrogram function from the
same module.

plt.figure(figsize=(7.5, 3.5))
plt.title("Iris Dendrogram") dendrogram(Z)
plt.show()
The resulting dendrogram (see the following plot) shows the hierarchical
relationship between the clusters. We can see that the algorithm has merged
the closest clusters first, and the distance between the clusters increases as
we move up the tree.

The final step is to apply the clustering algorithm and extract the cluster
labels. We can use the AgglomerativeClustering class from the sklearn.cluster
module to apply the algorithm.
model = AgglomerativeClustering(n_clusters=3)
model.fit(X) labels = model.labels_
The n_clusters parameter specifies the number of clusters to be extracted
from the data. In this case, we have specified n_clusters=3 because we know
that the iris dataset has three classes.

We can visualize the resulting clusters using a scatter plot.

plt.figure(figsize=(7.5, 3.5))
plt.scatter(X[:, 0], X[:, 1], c=labels)
plt.xlabel("Sepal length")
plt.ylabel("Sepal width")
plt.title("Agglomerative Clustering Results
plt.show()
The resulting plot shows the three clusters identified by the algorithm. We
can see that the algorithm has successfully separated the data points into
their respective classes.

Example
Here is the complete implementation of Agglomerative Clustering in Python

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt from sklearn.datasets
import load_iris from sklearn.cluster
import AgglomerativeClustering from scipy.cluster.hierarchy
import dendrogram, linkage # Load the Iris dataset iris =
load_iris() X = iris.data
y = iris.target
Z = linkage(X, 'ward')
# Plot the dendogram plt.figure(figsize=(7.5, 3.5))
plt.title("Iris Dendrogram")
dendrogram(Z)
plt.show() # create an instance of the AgglomerativeClustering
class model = AgglomerativeClustering(n_clusters=3) # fit the
model to the dataset model.fit(X) labels = model.labels_ # Plot
the results
plt.figure(figsize=(7.5, 3.5))
plt.scatter(X[:, 0], X[:, 1], c=labels)
plt.xlabel("Sepal length")
plt.ylabel("Sepal width")
plt.title("Agglomerative Clustering Results")
plt.show()
Advantages of Agglomerative Clustering

Following are the advantages of using Agglomerative Clustering −

 Produces a dendrogram that shows the hierarchical relationship between the


clusters.
 Can handle different types of distance metrics and linkage methods.
 Allows for a flexible number of clusters to be extracted from the data.
 Can handle large datasets with efficient implementations.

7. Partitional Clustering

 Divides the dataset into non-overlapping clusters, with each data point
belonging to exactly one cluster.
 Example algorithms: K-Means, K-Medoids.
 Focus is on optimizing a predefined objective function (e.g., minimizing
intra-cluster distance).

 The most popular class of clustering algorithms that we have is


the iterative relocation algorithms. These algorithms minimize a
given clustering criterion by iteratively relocating data points
between clusters until a (locally) optimal partition is attained.
There are many algorithms that come under partitioning method
some of the popular ones are K-Means, PAM(k-Medoid),
CLARA algorithm (Clustering Large Applications) etc.


 K-Means Algorithm (A centroid based
Technique): It is one of the most commonly used algorithm for
partitioning a given data set into a set of k groups (i.e. k clusters), where k
represents the number of groups. It classifies objects in multiple groups
(i.e., clusters), such that objects within the same cluster are as similar as
possible (i.e., high intra-class similarity), whereas objects from different
clusters are as dissimilar as possible (i.e., low inter-class similarity). In k-
means clustering, each cluster is represented by its center (i.e, centroid)
which corresponds to the mean of points assigned to the cluster. The basic
idea behind k-means clustering consists of defining clusters so that the
total intra-cluster variation (known as total within-cluster variation) is
minimized.


Steps involved in K-Means Clustering :

 The first step when using k-means clustering is to indicate the number of
clusters (k) that will be generated in the final solution.
 The algorithm starts by randomly selecting k objects from the data set to
serve as the initial centers for the clusters. The selected objects are also
known as cluster means or centroids.
 Next, each of the remaining objects is assigned to it’s closest centroid,
where closest is defined using the Euclidean distance between the object
and the cluster mean. This step is called “cluster assignment step”.
 After the assignment step, the algorithm computes the new mean value of
each cluster. The term cluster “centroid update” is used to design this step.
Now that the centers have been recalculated, every observation is checked
again to see if it might be closer to a different cluster. All the objects are
reassigned again using the updated cluster means.
 The cluster assignment and centroid update steps are iteratively repeated
until the cluster assignments stop changing (i.e until convergence is
achieved). That is, the clusters formed in the current iteration are the same
as those obtained in the previous iteration.

K-Medoids Algorithm (Partitioning Around Medoid) :

1. A medoid can be defined as the point in the cluster, whose similarities with
all the other points in the cluster is maximum.
2. In k-medoids clustering, each cluster is represented by one of the data point
in the cluster. These points are named cluster medoids. The term medoid
refers to an object within a cluster for which average dissimilarity between it
and all the other the members of the cluster is minimal. It corresponds to the
most centrally located point in the cluster.
3. These objects (one per cluster) can be considered as a representative
example of the members of that cluster which may be useful in some
situations. Recall that, in k-means clustering, the center of a given cluster is
calculated as the mean value of all the data points in the cluster.
4. K-medoid is a robust alternative to k-means clustering. This means that, the
algorithm is less sensitive to noise and outliers, compared to k-means,
because it uses medoids as cluster centers instead of means (used in k-
means).
Steps involved in K-Medoids Clustering :

1. The PAM algorithm is based on the search for k representative objects or


medoids among the observations of the data set.

2. After finding a set of k medoids, clusters are constructed by assigning each


observation to the nearest medoid.

3. Next, each selected medoid m and each non-medoid data point are swapped
and the objective function is computed. The objective function corresponds
to the sum of the dissimilarities of all objects to their nearest medoid.

4. The SWAP step attempts to improve the quality of the clustering by


exchanging selected objects (medoids) and non-selected objects. If the
objective function can be reduced by interchanging a selected object with an
unselected object, then the swap is carried out. This is continued until the
objective function can no longer be decreased. The goal is to find k
representative objects which minimize the sum of the dissimilarities of the
observations to their closest representative object.

1. K-means attempts to minimize the total squared error, while k-medoids


minimizes the sum of dissimilarities between points labeled to be in a cluster
and a point designated as the center of that cluster. In contrast to the k-means
algorithm, k-medoids chooses data points as centers ( medoids or
exemplars).
2. K-medoid is more robust to noise and outliers as compared to K-means
because it minimizes a sum of general pairwise dissimilarities instead of a
sum of squared Euclidean distances.
3. K-medoids has the advantage of working on distances other than
numerical and lends itself well to analyse mixed-type data that include both
numerical and categorical features.
8. K-Means Clustering

K-Means Clustering is an Unsupervised Learning algorithm, which groups the


unlabeled dataset into different clusters. Here K defines the number of pre-
defined clusters that need to be created in the process, as if K=2, there will be
two clusters, and for K=3, there will be three clusters, and so on.

It is an iterative algorithm that divides the unlabeled dataset into k different


clusters in such a way that each dataset belongs only one group that has similar
properties.
It allows us to cluster the data into different groups and a convenient way to
discover the categories of groups in the unlabeled dataset on its own without the
need for any training.

It is a centroid-based algorithm, where each cluster is associated with a centroid.


The main aim of this algorithm is to minimize the sum of distances between the
data point and their corresponding clusters.

he algorithm takes the unlabeled dataset as input, divides the dataset into k-
number of clusters, and repeats the process until it does not find the best
clusters. The value of k should be predetermined in this algorithm.

The k-means clustering algorithm mainly performs two tasks:

o Determines the best value for K center points or centroids by an iterative


process.
o Assigns each data point to its closest k-center. Those data points which
are near to the particular k-center, create a cluster.
Hence each cluster has datapoints with some commonalities, and it is away
from other clusters.

The below diagram explains the working of the K-means Clustering Algorithm:
How does the K-Means Algorithm Work?
The working of the K-Means algorithm is explained in the below steps:

Step-1: Select the number K to decide the number of clusters.

Step-2: Select random K points or centroids. (It can be other from the input
dataset).

Step-3: Assign each data point to their closest centroid, which will form the
predefined K clusters.

Step-4: Calculate the variance and place a new centroid of each cluster.

Step-5: Repeat the third steps, which means reassign each datapoint to the new
closest centroid of each cluster.

Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

Step-7: The model is ready.

Let's understand the above steps by considering the visual plots:

Suppose we have two variables M1 and M2. The x-y axis scatter plot of these
two variables is given below:

o Let's take number k of clusters, i.e., K=2, to identify the dataset and to put
them into different clusters. It means here we will try to group these
datasets into two different clusters.
o We need to choose some random k points or centroid to form the cluster.
These points can be either the points from the dataset or any other point.
So, here we are selecting the below two points as k points, which are not
the part of our dataset. Consider the below image:

o Now we will assign each data point of the scatter plot to its closest K-
point or centroid. We will compute it by applying some mathematics that
we have studied to calculate the distance between two points. So, we will
draw a median between both the centroids. Consider the below image:

From the above image, it is clear that points left side of the line is near to the K1
or blue centroid, and points to the right of the line are close to the yellow
centroid. Let's color them as blue and yellow for clear visualization.
o As we need to find the closest cluster, so we will repeat the process by
choosing a new centroid. To choose the new centroids, we will compute
the center of gravity of these centroids, and will find new centroids as
below:

o Next, we will reassign each datapoint to the new centroid. For this, we
will repeat the same process of finding a median line. The median will be
like below image:

From the above image, we can see, one yellow point is on the left side of the
line, and two blue points are right to the line. So, these three points will be
assigned to new centroids.

As reassignment has taken place, so we will again go to the step-4, which is


finding new centroids or K-points.
o We will repeat the process by finding the center of gravity of centroids,
so the new centroids will be as shown in the below image:

o As we got the new centroids so again will draw the median line and
reassign the data points. So, the image will be:

o We can see in the above image; there are no dissimilar data points on
either side of the line, which means our model is formed. Consider the
below image:

As our model is ready, so we can now remove the assumed centroids, and the
two final clusters will be as shown in the below image:

How to choose the value of "K number of clusters" in K-means Clustering?


The performance of the K-means clustering algorithm depends upon highly
efficient clusters that it forms. But choosing the optimal number of clusters is a
big task. There are some different ways to find the optimal number of clusters,
but here we are discussing the most appropriate method to find the number of
clusters or value of K. The method is given below:

Elbow Method

The Elbow method is one of the most popular ways to find the optimal number
of clusters. This method uses the concept of WCSS value. WCSS stands
for Within Cluster Sum of Squares, which defines the total variations within a
cluster. The formula to calculate the value of WCSS (for 3 clusters) is given
below:

WCSS= ∑Pi in Cluster1 distance(Pi C1)2 +∑Pi in Cluster2distance(Pi C2)2+∑Pi in


2
CLuster3 distance(Pi C3)

In the above formula of WCSS,

∑Pi in Cluster1 distance(Pi C1)2: It is the sum of the square of the distances between
each data point and its centroid within a cluster1 and the same for the other two
terms.

To measure the distance between data points and centroid, we can use any
method such as Euclidean distance or Manhattan distance.

To find the optimal value of clusters, the elbow method follows the below steps:

o It executes the K-means clustering on a given dataset for different K


values (ranges from 1-10).
o For each value of K, calculates the WCSS value.
o Plots a curve between calculated WCSS values and the number of
clusters K.
o The sharp point of bend or a point of the plot looks like an arm, then that
point is considered as the best value of K.
Since the graph shows the sharp bend, which looks like an elbow, hence it is
known as the elbow method. The graph for the elbow method looks like the
below image:
Note: We can choose the number of clusters equal to the given data points.
If we choose the number of clusters equal to the data points, then the value
of WCSS becomes zero, and that will be the endpoint of the plot.
Python Implementation of K-means Clustering Algorithm
In the above section, we have discussed the K-means algorithm, now let's see
how it can be implemented using Python.

Before implementation, let's understand what type of problem we will solve


here. So, we have a dataset of Mall_Customers, which is the data of customers
who visit the mall and spend there.

In the given dataset, we have Customer_Id, Gender, Age, Annual Income ($),
and Spending Score (which is the calculated value of how much a customer
has spent in the mall, the more the value, the more he has spent). From this
dataset, we need to calculate some patterns, as it is an unsupervised method, so
we don't know what to calculate exactly.

The steps to be followed for the implementation are given below:

o Data Pre-processing
o Finding the optimal number of clusters using the elbow method
o Training the K-means algorithm on the training dataset
o Visualizing the clusters

Step-1: Data pre-processing Step

The first step will be the data pre-processing, as we did in our earlier topics of
Regression and Classification. But for the clustering problem, it will be
different from other models. Let's discuss it:

o Importing Libraries
As we did in previous topics, firstly, we will import the libraries for our
model, which is part of data pre-processing. The code is given below:
1. # importing libraries
2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd
In the above code, the numpy we have imported for the performing
mathematics calculation, matplotlib is for plotting the graph, and pandas are
for managing the dataset.

o Importing the Dataset:


Next, we will import the dataset that we need to use. So here, we are
using the Mall_Customer_data.csv dataset. It can be imported using the
below code:
1. # Importing the dataset
2. dataset = pd.read_csv('Mall_Customers_data.csv')
By executing the above lines of code, we will get our dataset in the Spyder IDE.
The dataset looks like the below image:

From the above dataset, we need to find some patterns in it.

o Extracting Independent Variables


Here we don't need any dependent variable for data pre-processing step as it is a
clustering problem, and we have no idea about what to determine. So we will
just add a line of code for the matrix of features.

1. x = dataset.iloc[:, [3, 4]].values


As we can see, we are extracting only 3 rd and 4th feature. It is because we need a
2d plot to visualize the model, and some features are not required, such as
customer_id.

Step-2: Finding the optimal number of clusters using the elbow method

In the second step, we will try to find the optimal number of clusters for our
clustering problem. So, as discussed above, here we are going to use the elbow
method for this purpose.

As we know, the elbow method uses the WCSS concept to draw the plot by
plotting WCSS values on the Y-axis and the number of clusters on the X-axis.
So we are going to calculate the value for WCSS for different k values ranging
from 1 to 10. Below is the code for it:
1. #finding optimal number of clusters using the elbow method
2. from sklearn.cluster import KMeans
3. wcss_list= [] #Initializing the list for the values of WCSS
4.
5. #Using for loop for iterations from 1 to 10.
6. for i in range(1, 11):
7. kmeans = KMeans(n_clusters=i, init='k-means++', random_state= 42)
8. kmeans.fit(x)
9. wcss_list.append(kmeans.inertia_)
10.mtp.plot(range(1, 11), wcss_list)
11.mtp.title('The Elobw Method Graph')
12.mtp.xlabel('Number of clusters(k)')
13.mtp.ylabel('wcss_list')
14.mtp.show()
As we can see in the above code, we have used the KMeans class of sklearn.
cluster library to form the clusters.

Next, we have created the wcss_list variable to initialize an empty list, which is
used to contain the value of wcss computed for different values of k ranging
from 1 to 10.

After that, we have initialized the for loop for the iteration on a different value
of k ranging from 1 to 10; since for loop in Python, exclude the outbound limit,
so it is taken as 11 to include 10th value.

The rest part of the code is similar as we did in earlier topics, as we have fitted
the model on a matrix of features and then plotted the graph between the
number of clusters and WCSS.

Output: After executing the above code, we will get the below output:

From the above plot, we can see the elbow point is at 5. So the number of
clusters here will be 5.
Step- 3: Training the K-means algorithm on the training dataset

As we have got the number of clusters, so we can now train the model on the
dataset.

To train the model, we will use the same two lines of code as we have used in
the above section, but here instead of using i, we will use 5, as we know there
are 5 clusters that need to be formed. The code is given below:

1. #training the K-means model on a dataset


2. kmeans = KMeans(n_clusters=5, init='k-means++', random_state= 42)
3. y_predict= kmeans.fit_predict(x)
The first line is the same as above for creating the object of KMeans class.

In the second line of code, we have created the dependent variable y_predict to
train the model.

By executing the above lines of code, we will get the y_predict variable. We can
check it under the variable explorer option in the Spyder IDE. We can now
compare the values of y_predict with our original dataset. Consider the below
image:
From the above image, we can now relate that the CustomerID 1 belongs to a
cluster

3(as index starts from 0, hence 2 will be considered as 3), and 2 belongs to
cluster 4, and so on.

Step-4: Visualizing the Clusters

The last step is to visualize the clusters. As we have 5 clusters for our model, so
we will visualize each cluster one by one.

To visualize the clusters will use scatter plot using mtp.scatter() function of
matplotlib.

1. #visulaizing the clusters


2. mtp.scatter(x[y_predict == 0, 0], x[y_predict == 0, 1],
3. s = 100, c = 'blue', label = 'Cluster 1') #for first cluster
4. mtp.scatter(x[y_predict == 1, 0], x[y_predict == 1, 1], s = 100, c = 'green', label
= 'Cluster 2') #for second cluster
5. mtp.scatter(x[y_predict== 2, 0], x[y_predict == 2, 1], s = 100, c = 'red', label = '
Cluster 3') #for third cluster
6. mtp.scatter(x[y_predict == 3, 0], x[y_predict == 3, 1], s = 100, c = 'cyan', label
= 'Cluster 4') #for fourth cluster
7. mtp.scatter(x[y_predict == 4, 0], x[y_predict == 4, 1], s = 100, c = 'magenta', la
bel = 'Cluster 5') #for fifth cluster
8. mtp.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s = 300
, c = 'yellow', label = 'Centroid')
9. mtp.title('Clusters of customers')
10.mtp.xlabel('Annual Income (k$)')
11.mtp.ylabel('Spending Score (1-100)')
12.mtp.legend()
13.mtp.show()
In above lines of code, we have written code for each clusters, ranging from 1 to
5. The first coordinate of the mtp.scatter, i.e., x[y_predict == 0, 0] containing
the x value for the showing the matrix of features values, and the y_predict is
ranging from 0 to 1.

Output:

The output image is clearly showing the five different clusters with different
colors. The clusters are formed between two parameters of the dataset; Annual
income of customer and Spending. We can change the colors and labels as per
the requirement or choice. We can also observe some points from the above
patterns, which are given below:

o Cluster1 shows the customers with average salary and average spending
so we can categorize these customers as
o Cluster2 shows the customer has a high income but low spending, so we
can categorize them as careful.
o Cluster3 shows the low income and also low spending so they can be
categorized as sensible.
o Cluster4 shows the customers with low income with very high spending
so they can be categorized as careless.
o Cluster5 shows the customers with high income and high spending so
they can be categorized as target, and these customers can be the most
profitable customers for the mall owner.

9. Soft Partitioning and Soft Clustering

 Soft Partitioning allows data points to belong to multiple clusters with


membership probabilities.
 Soft Clustering methods, such as Fuzzy C-Means, assign degrees of
membership rather than hard assignments.

In NLP, hard clustering and soft clustering refer to different methods of grouping
text data based on similarity.
Hard clustering, also known as crisp clustering or traditional clustering, is a
technique that assigns each data point, in this case, a document or word, to
exactly one cluster. The assignment is based on a similarity measure, such as
cosine similarity or Euclidean distance, which calculates the similarity between
the data points. Hard clustering algorithms like k-means or hierarchical
clustering partition the data into distinct, non-overlapping clusters. Each
document or word belongs exclusively to a single cluster, and there is no
ambiguity in the assignment.

On the other hand, soft clustering, also called fuzzy clustering or probabilistic
clustering, allows for more flexible assignments. Instead of assigning a data
point to a single cluster, soft clustering assigns a probability or degree of
membership to each cluster. This means that a document or word can belong to
multiple clusters simultaneously, with varying degrees of membership. Soft
clustering algorithms, such as fuzzy c-means or Gaussian mixture models,
calculate the likelihood of a data point belonging to each cluster based on
probabilistic models.

Hard clustering: Assigns each data point to a single, exclusive cluster based on
similarity. There is no ambiguity in the assignment.
Soft clustering: Assigns a degree of membership or probability to each cluster
for a data point. Allows for overlapping membership in multiple clusters.

Both hard and soft clustering approaches have their benefits and applications in
NLP.

Hard clustering is useful when clear-cut, non-overlapping clusters are desired,


and when there is no ambiguity in the assignment.

Soft clustering, on the other hand, provides more nuanced results, allowing for
overlapping clusters and capturing uncertainty in the data. It can be helpful when
documents or words can have multiple interpretations or belong to multiple
topics simultaneously.

One of the most commonly used methods for hard clustering is the k-means
algorithm. The k-means algorithm is an iterative clustering algorithm that aims
to partition a given dataset into k distinct clusters.

The k-means algorithm is widely used due to its simplicity and efficiency.
However, it is important to keep in mind that the algorithm’s performance can be
sensitive to the initial centroid selection, and it assumes that clusters have a
spherical shape and similar sizes. Various techniques, such as multiple
initializations and more advanced initialization methods like k-means++, can be
employed to mitigate these limitations.

k-means algorithm:

1. Initialization: Select k initial cluster centroids randomly or using a specific


initialization method.

2. Assignment: Assign each data point to the nearest centroid based on a


distance metric, typically Euclidean distance.

3. Update: Recalculate the centroid of each cluster by taking the mean of all
data points assigned to that cluster.

4. Repeat Steps 2 and 3 until convergence: Iterate the assignment and update
steps until the centroids stabilize or a predefined stopping criterion is met.
Convergence is typically assessed based on the change in the centroids’
positions or the overall objective function.

5. Final Result: Once the algorithm converges, each data point will be
assigned to a single cluster, resulting in a hard clustering solution.
Other commonly used hard clustering methods include hierarchical clustering
algorithms (e.g., agglomerative clustering, divisive clustering) that build a
hierarchical structure of clusters, as well as density-based clustering algorithms
like DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
that can discover clusters of arbitrary shapes based on density connectivity.

One of the commonly used methods for soft clustering is the fuzzy c-means
(FCM) algorithm. FCM is a popular soft clustering algorithm that allows for
probabilistic or fuzzy assignments of data points to clusters.

Fuzzy c-means algorithm:

1. Initialization: Select the number of clusters (k) and initialize the cluster
centers and membership degrees randomly or using a specific
initialization method.

2. Membership Degree Calculation: Calculate the membership degree for


each data point indicating its degree of belongingness to each cluster. The
membership degree represents the likelihood or probability of a data
point being assigned to a particular cluster.

3. Update Cluster Centers: Calculate new cluster centers by considering the


weighted average of data points based on their membership degrees. The
weights reflect the degree of influence of each data point on the cluster
center calculation.

4. Repeat Steps 2 and 3 until convergence: Iterate the membership degree


calculation and cluster center update steps until the membership degrees
and cluster centers stabilize or a predefined stopping criterion is met.

5. Final Result: Once the algorithm converges, each data point will have
membership degrees across all clusters, indicating the degree of its
association with each cluster. This provides a soft clustering solution with
data points having varying degrees of membership to multiple clusters.
The fuzzy c-means algorithm allows for more flexible assignments compared to
hard clustering algorithms. It captures the uncertainty or fuzziness in data point
assignments and provides a probabilistic representation of cluster membership.
The algorithm can accommodate overlapping clusters and is suitable for
scenarios where data points may have ambiguous or mixed characteristics.

Apart from fuzzy c-means, other soft clustering methods include Gaussian
mixture models (GMM), where data points are modeled as a mixture of
Gaussian distributions, and probabilistic latent semantic analysis (pLSA) or
latent Dirichlet allocation (LDA) for document clustering, which use
probabilistic models to assign documents to multiple topics with different
probabilities.

1. Fuzzy C-Means Clustering

Fuzzy logic principles can be used to cluster multidimensional data, assigning


each point a membership in each cluster center from 0 to 100 percent. This can
be very powerful compared to traditional hard-threshold clustering where every
point is assigned a crisp, exact label. This algorithm works by assigning
membership to each data point corresponding to each cluster center on the basis
of distance between the cluster center and the data point. More the data is near to
the cluster center more is its membership towards the particular cluster center.
Clearly, summation of membership of each data point should be equal to one.

It is an unsupervised clustering algorithm that permits us to build a fuzzy


partition from data. The algorithm depends on a parameter m which corresponds
to the degree of fuzziness of the solution. Large values of m will blur the classes
and all elements tend to belong to all clusters. The solutions of the optimization
problem depend on the parameter m. That is, different selections of m will
typically lead to different partitions. Given below is a gif that shows the effect of
the selection of m obtained from the fuzzy c-means.
Credits: https://2-bitbio.com/

K-Means versus Fuzzy C-Means

Let us compare these two powerful algorithms to get a clear idea of where the
fuzzy c-means algorithm fits in.

1. Attribution to a cluster: In fuzzy clustering, each point has a probability of


belonging to each cluster, rather than completely belonging to just one
cluster as it is the case in the traditional k-means. In Fuzzy-C Means
clustering, each point has a weighting associated with a particular cluster, so
a point doesn’t sit “in a cluster” as much as has a weak or strong association
to the cluster, which is determined by the inverse distance to the center of
the cluster.
2. Speed: Fuzzy-C means will tend to run slower than K means, since it’s
actually doing more work. Each point is evaluated with each cluster, and
more operations are involved in each evaluation. K-Means just needs to do a
distance calculation, whereas fuzzy c means needs to do a full inverse-
distance weighting.
3. Personal Opinion: FCM/Soft-K-Means is “less stupid” than Hard-K-
Means when it comes to elongated clusters (when points otherwise
consistent in other dimensions tend to scatter along a particular dimension or
two).

We should realize that fuzzy c-means is a special case of K-means when the
probability function used is simply 1 if the data point is closest to a centroid and
0 otherwise.

Steps in Fuzzy C-Means

Image Credits: https://www.researchgate.net

The process flow of fuzzy c-means is enumerated below:


1. Assume a fixed number of clusters k.
2. Initialization: Randomly initialize the k-means μk associated with the
clusters and compute the probability that each data point xi is a member of a
given cluster k, P(point xi has label k|xi, k).
3. Iteration: Recalculate the centroid of the cluster as the weighted centroid
given the probabilities of membership of all data points xi:

4. Termination: Iterate until convergence or until a user-specified number of


iterations has been reached (the iteration may be trapped at some local maxima
or minima).

Implementation in python can be found here. A mathematical explanation can be


found here.

Evaluation Metrics for Clusters

Most of the measures used to evaluate the clusters under other clustering
algorithms can be used for Fuzzy C-Means. Even though these methods rely on
the subject matter of the expert, I am listing below some popular measures used
to evaluate the clusters so formed:

1. Homogeneity analysis of the clusters formed.


2. The clusters thus formed using Fuzzy C-Means, need to homogeneous and
separated from other clusters.
3. Coefficient of Variance analysis for each cluster.
4. Pearson Correlation can be used for validating the quality of clusters.
5. If we have ground truth cluster values, precision, recall, and f-score can also
be considered.
6. Elbow Method and Silhouette are also some statistical measures for
evaluating your clusters (I would rather use them to in pre-definition of
cluster number).
7. Entropy-based methods

Pros and Cons

Now comes the time to evaluate the algorithm itself!

Pros

1. Gives best result for overlapped data set and comparatively better then k-
means algorithm.
2. Unlike k-means where data point must exclusively belong to one cluster
center here data point is assigned membership to each cluster center as a
result of which data point may belong to more then one cluster center.

Cons

1. Apriori specification of the number of clusters.


2. With lower value of β we get the better result but at the expense of more
number of iteration.
3. Euclidean distance measures can unequally weight underlying factors.
4. The performance of the FCM algorithm depends on the selection of the
initial cluster center and/or the initial membership value.

11. Rough Clustering

Rough Clustering is a clustering technique in machine learning that integrates the concepts
of rough set theory into traditional clustering methods. Unlike conventional clustering
approaches, where an object strictly belongs to one cluster, rough clustering allows for
uncertainty and ambiguity by associating each object with two boundary regions: a lower
approximation and an upper approximation.

Key Concepts in Rough Clustering:

1. Lower Approximation:
o The set of objects that definitely belong to a cluster.
o These objects have a high certainty of membership based on similarity measures.

2. Upper Approximation:
o The set of objects that possibly belong to a cluster.
o These objects have some level of similarity with the cluster but could also belong to
other clusters.

3. Boundary Region:
o The difference between the upper and lower approximations.
o It contains objects whose cluster memberships are uncertain or overlap multiple
clusters.

Characteristics:

 Handles Uncertainty: Rough clustering is particularly useful when dealing with noisy data or
overlapping clusters.
 Dual Membership: Objects can partially belong to multiple clusters through the boundary
region.
 Flexibility: By allowing overlapping memberships, rough clustering provides a more realistic
representation of data in scenarios where rigid partitions are unsuitable.

Applications:

 Image Segmentation: To classify pixels with ambiguous boundaries.


 Text Categorization: When documents can belong to multiple topics.
 Customer Segmentation: For grouping customers with overlapping characteristics in
marketing.

Methods of Rough Clustering:

1. Rough K-Means:
o An extension of the K-Means algorithm that incorporates rough set theory.
o Each data point is classified into lower and upper approximations based on its
distance from cluster centroids.

2. Rough Fuzzy Clustering:


o Combines rough set theory with fuzzy clustering.
o Objects have membership degrees in addition to being in lower or upper
approximations.

3. Rough Hierarchical Clustering:


o Adapts rough set concepts to hierarchical clustering methods.
o Useful for identifying uncertain relationships at various levels of the hierarchy.

Advantages:

 Better handling of overlapping clusters.


 Provides a structured way to deal with ambiguous or uncertain data.
 Offers more interpretability in complex datasets with overlapping regions.

Limitations:

 May involve additional computational complexity compared to traditional clustering


methods.
 The choice of thresholds for determining lower and upper approximations can be subjective.

Rough clustering is especially valuable in domains where data boundaries are not well-
defined or when partial membership is an inherent characteristic of the dataset.
12. Rough K-Means Clustering Algorithm

The K-Means Clustering Algorithm is a widely used unsupervised machine learning


algorithm for partitioning a dataset into kkk distinct groups (clusters). Here’s an outline of the
algorithm, including the rough steps it follows:

Algorithm: K-Means Clustering

1. Initialization:
Randomly select kkk initial centroids from the dataset. These centroids represent the
centers of the kkk clusters.
2. Assignment Step:
Assign each data point to the cluster whose centroid is closest (based on a distance
metric, typically Euclidean distance).
3. Update Step:
Compute the new centroid for each cluster by averaging the positions of all points
assigned to that cluster.
4. Repeat:
Alternate between the assignment and update steps until the centroids stabilize (i.e.,
the changes in centroids between iterations are below a threshold) or a maximum
number of iterations is reached.
5. Output:
The algorithm outputs the final cluster assignments and their corresponding centroids.

Rough Pseudocode
python
Copy code
function k_means(data, k, max_iters):
# Step 1: Initialize centroids randomly
centroids = initialize_random_centroids(data, k)

for iter in range(max_iters):


# Step 2: Assign each data point to the closest centroid
clusters = assign_to_clusters(data, centroids)

# Step 3: Update centroids based on the cluster assignments


new_centroids = calculate_new_centroids(clusters)

# Check for convergence (if centroids don't change significantly)


if has_converged(centroids, new_centroids):
break

centroids = new_centroids

return clusters, centroids

Key Features

1. Input Parameters:
o kkk: Number of clusters.
o Dataset: A set of nnn data points.
o Maximum iterations: To ensure the algorithm terminates.
2. Distance Metric:
Euclidean distance is the default, but other metrics like Manhattan or cosine distance
can also be used.
3. Stopping Criteria:
The algorithm stops when:
o Centroids do not change significantly between iterations.
o A maximum number of iterations is reached.

4. Output:
o Cluster assignments for all data points.
o Final centroid locations.

Limitations

1. Random Initialization: Poor initial centroid selection can lead to suboptimal


solutions.
o Solution: Use K-Means++ for smarter centroid initialization.

2. Local Minima: The algorithm may converge to a local minimum.


o Solution: Run the algorithm multiple times with different initializations.

3. Cluster Shape: Assumes clusters are spherical and equally sized.


o Solution: Consider alternative clustering methods like DBSCAN or Gaussian Mixture
Models for complex shapes.

4. Outliers: Sensitive to outliers, which can distort the centroids.

13. Expectation-Maximization (EM)-Based Clustering

The Expectation-Maximization (EM)-Based Clustering Algorithm is a


probabilistic approach to clustering that aims to find clusters in data by fitting a
mixture of probability distributions, such as Gaussian Mixture Models (GMMs).
It alternates between estimating the The expectation-Maximization algorithm
can be used to handle situations where variables are partially observable.
When certain variables are observable, we can use those instances to learn and
estimate their values. Then, we can predict the values of these variables in
instances when it is not observable.
The EM algorithm was proposed and named in a seminal paper published in
1977 by Arthur Dempster, Nan Laird, and Donald Rubin. Their work
formalized the algorithm and demonstrated its usefulness in statistical
modeling and estimation.
EM algorithm is applicable to latent variables, which are variables that are not
directly observable but are inferred from the values of other observed
variables. By leveraging the known general form of the probability distribution
governing these latent variables, the EM algorithm can predict their values.
The EM algorithm serves as the foundation for many unsupervised clustering
algorithms in the field of machine learning. It provides a framework to find the
local maximum likelihood parameters of a statistical model and infer latent
variables in cases where data is missing or incomplete.
Expectation-Maximization (EM) Algorithm
The Expectation-Maximization (EM) algorithm is an iterative optimization
method that combines different unsupervised machine learning algorithms to
find maximum likelihood or maximum posterior estimates of parameters in
statistical models that involve unobserved latent variables. The EM algorithm
is commonly used for latent variable models and can handle missing data. It
consists of an estimation step (E-step) and a maximization step (M-step),
forming an iterative process to improve model fit.
 In the E step, the algorithm computes the latent variables i.e. expectation of
the log-likelihood using the current parameter estimates.
 In the M step, the algorithm determines the parameters that maximize the
expected log-likelihood obtained in the E step, and corresponding model
parameters are updated based on the estimated latent variables.

Expectation-Maximization in EM Algorithm

By iteratively repeating these steps, the EM algorithm seeks to maximize the


likelihood of the observed data. It is commonly used for unsupervised learning
tasks, such as clustering, where latent variables are inferred and has
applications in various fields, including machine learning, computer vision,
and natural language processing.
Key Terms in Expectation-Maximization (EM) Algorithm
Some of the most commonly used key terms in the Expectation-Maximization
(EM) Algorithm are as follows:
 Latent Variables: Latent variables are unobserved variables in statistical
models that can only be inferred indirectly through their effects on
observable variables. They cannot be directly measured but can be detected
by their impact on the observable variables.
 Likelihood: It is the probability of observing the given data given the
parameters of the model. In the EM algorithm, the goal is to find the
parameters that maximize the likelihood.
 Log-Likelihood: It is the logarithm of the likelihood function, which
measures the goodness of fit between the observed data and the model. EM
algorithm seeks to maximize the log-likelihood.
 Maximum Likelihood Estimation (MLE): MLE is a method to estimate
the parameters of a statistical model by finding the parameter values that
maximize the likelihood function, which measures how well the model
explains the observed data.
 Posterior Probability: In the context of Bayesian inference, the EM
algorithm can be extended to estimate the maximum a posteriori (MAP)
estimates, where the posterior probability of the parameters is calculated
based on the prior distribution and the likelihood function.
 Expectation (E) Step: The E-step of the EM algorithm computes the
expected value or posterior probability of the latent variables given the
observed data and current parameter estimates. It involves calculating the
probabilities of each latent variable for each data point.
 Maximization (M) Step: The M-step of the EM algorithm updates the
parameter estimates by maximizing the expected log-likelihood obtained
from the E-step. It involves finding the parameter values that optimize the
likelihood function, typically through numerical optimization methods.
 Convergence: Convergence refers to the condition when the EM algorithm
has reached a stable solution. It is typically determined by checking if the
change in the log-likelihood or the parameter estimates falls below a
predefined threshold.
How Expectation-Maximization (EM) Algorithm Works:
The essence of the Expectation-Maximization algorithm is to use the available
observed data of the dataset to estimate the missing data and then use that data
to update the values of the parameters. Let us understand the EM algorithm in
detail.

EM Algorithm Flowchart

1. Initialization:
 Initially, a set of initial values of the parameters are considered. A set of
incomplete observed data is given to the system with the assumption
that the observed data comes from a specific model.
2. E-Step (Expectation Step): In this step, we use the observed data in order
to estimate or guess the values of the missing or incomplete data. It is
basically used to update the variables.

 Compute the posterior probability or responsibility of each latent


variable given the observed data and current parameter estimates.
 Estimate the missing or incomplete data values using the current
parameter estimates.
 Compute the log-likelihood of the observed data based on the current
parameter estimates and estimated missing data.
3. M-step (Maximization Step): In this step, we use the complete data
generated in the preceding “Expectation” – step in order to update the
values of the parameters. It is basically used to update the hypothesis.
 Update the parameters of the model by maximizing the expected
complete data log-likelihood obtained from the E-step.
 This typically involves solving optimization problems to find the
parameter values that maximize the log-likelihood.
 The specific optimization technique used depends on the nature of the
problem and the model being used.
4. Convergence: In this step, it is checked whether the values are converging
or not, if yes, then stop otherwise repeat step-2 and step-3 i.e.
“Expectation” – step and “Maximization” – step until the convergence
occurs.
 Check for convergence by comparing the change in log-likelihood or the
parameter values between iterations.
 If the change is below a predefined threshold, stop and consider the
algorithm converged.
 Otherwise, go back to the E-step and repeat the process until
convergence is achieved.
Expectation-Maximization Algorithm Step by Step
Implementation
Import the necessary libraries

import numpy as np

import seaborn as sns

from scipy.stats import norm

from scipy.stats import gaussian_kde


import matplotlib.pyplot as plt
Generate a dataset with two Gaussian components
Python

1
# Generate a dataset with two Gaussian components
2

mu1, sigma1 = 2, 1
3

mu2, sigma2 = -1, 0.8


4

X1 = np.random.normal(mu1, sigma1, size=200)


5

X2 = np.random.normal(mu2, sigma2, size=600)


6

X = np.concatenate([X1, X2])
7

# Plot the density estimation using seaborn


9

sns.kdeplot(X)
10

plt.xlabel('X')
11

plt.ylabel('Density')
12

plt.title('Density Estimation of X')


13

plt.show()

Output:

Density Plot

Initialize parameters
Python
1

# Initialize parameters
2

mu1_hat, sigma1_hat = np.mean(X1), np.std(X1)


3

mu2_hat, sigma2_hat = np.mean(X2), np.std(X2)


4

pi1_hat, pi2_hat = len(X1) / len(X), len(X2) / len(X)

Perform EM algorithm
 Iterates for the specified number of epochs (20 in this case).
 In each epoch, the E-step calculates the responsibilities (gamma values) by
evaluating the Gaussian probability densities for each component and
weighting them by the corresponding proportions.
 The M-step updates the parameters by computing the weighted mean and
standard deviation for each component

Python

# Perform EM algorithm for 20 epochs


2

num_epochs = 20
3

log_likelihoods = []
4

for epoch in range(num_epochs):


6

# E-step: Compute responsibilities


7

gamma1 = pi1_hat * norm.pdf(X, mu1_hat, sigma1_hat)


8

gamma2 = pi2_hat * norm.pdf(X, mu2_hat, sigma2_hat)


9

total = gamma1 + gamma2


10

gamma1 /= total
11
gamma2 /= total
12

13

# M-step: Update parameters


14

mu1_hat = np.sum(gamma1 * X) / np.sum(gamma1)


15

mu2_hat = np.sum(gamma2 * X) / np.sum(gamma2)


16

sigma1_hat = np.sqrt(np.sum(gamma1 * (X - mu1_hat)**2) /


np.sum(gamma1))
17

sigma2_hat = np.sqrt(np.sum(gamma2 * (X - mu2_hat)**2) /


np.sum(gamma2))
18

pi1_hat = np.mean(gamma1)
19

pi2_hat = np.mean(gamma2)
20

21

# Compute log-likelihood
22

log_likelihood = np.sum(np.log(pi1_hat * norm.pdf(X, mu1_hat,


sigma1_hat)
23

+ pi2_hat * norm.pdf(X, mu2_hat, sigma2_hat)))


24

log_likelihoods.append(log_likelihood)
25

26

# Plot log-likelihood values over epochs


27

plt.plot(range(1, num_epochs+1), log_likelihoods)


28

plt.xlabel('Epoch')
29

plt.ylabel('Log-Likelihood')
30

plt.title('Log-Likelihood vs. Epoch')


31

plt.show()

Output:
Epoch vs Log-likelihood

Plot the final estimated density


Python

# Plot the final estimated density


2

X_sorted = np.sort(X)
3

density_estimation = pi1_hat*norm.pdf(X_sorted,
4

mu1_hat,
5

sigma1_hat) + pi2_hat * norm.pdf(X_sorted,


6

mu2_hat,
7

sigma2_hat)
8

9
10

plt.plot(X_sorted, gaussian_kde(X_sorted)(X_sorted), color='green',


linewidth=2)
11

plt.plot(X_sorted, density_estimation, color='red', linewidth=2)


12

plt.xlabel('X')
13

plt.ylabel('Density')
14

plt.title('Density Estimation of X')


15

plt.legend(['Kernel Density Estimation','Mixture Density'])


16

plt.show()
Output:

Estimated density

Applications of the EM algorithm


 It can be used to fill in the missing data in a sample.
 It can be used as the basis of unsupervised learning of clusters.
 It can be used for the purpose of estimating the parameters of the Hidden
Markov Model (HMM).
 It can be used for discovering the values of latent variables.
Advantages of EM algorithm
 It is always guaranteed that likelihood will increase with each iteration.
 The E-step and M-step are often pretty easy for many problems in terms of
implementation.
 Solutions to the M-steps often exist in the closed form.
Disadvantages of EM algorithm
 It has slow convergence.
 It makes convergence to the local optima only.
 It requires both the probabilities, forward and backward (numerical
optimization requires only forward probability).

14. Spectral Clustering


Spectral Clustering is a variant of the clustering algorithm that uses the
connectivity between the data points to form the clustering. It uses eigenvalues
and eigenvectors of the data matrix to forecast the data into lower dimensions
space to cluster the data points. It is based on the idea of a graph
representation of data where the data point are represented as nodes and the
similarity between the data points are represented by an edge.

Steps performed for spectral Clustering

Building the Similarity Graph Of The Data: This step builds the Similarity
Graph in the form of an adjacency matrix which is represented by A. The
adjacency matrix can be built in the following manners:
 Epsilon-neighbourhood Graph: A parameter epsilon is fixed beforehand.
Then, each point is connected to all the points which lie in its epsilon-
radius. If all the distances between any two points are similar in scale then
typically the weights of the edges ie the distance between the two points are
not stored since they do not provide any additional information. Thus, in
this case, the graph built is an undirected and unweighted graph.
 K-Nearest Neighbours A parameter k is fixed beforehand. Then, for two
vertices u and v, an edge is directed from u to v only if v is among the k-
nearest neighbours of u. Note that this leads to the formation of a weighted
and directed graph because it is not always the case that for each u having v
as one of the k-nearest neighbours, it will be the same case for v having u
among its k-nearest neighbours. To make this graph undirected, one of the
following approaches is followed:-
1. Direct an edge from u to v and from v to u if either v is among the k-
nearest neighbours of u OR u is among the k-nearest neighbours of v.
2. Direct an edge from u to v and from v to u if v is among the k-nearest
neighbours of u AND u is among the k-nearest neighbours of v.
3. Fully-Connected Graph: To build this graph, each point is connected
with an undirected edge-weighted by the distance between the two
points to every other point. Since this approach is used to model the
local neighbourhood relationships thus typically the Gaussian similarity
metric is used to calculate the distance.
Projecting the data onto a lower Dimensional Space: This step is done to
account for the possibility that members of the same cluster may be far away
in the given dimensional space. Thus the dimensional space is reduced so that
those points are closer in the reduced dimensional space and thus can be
clustered together by a traditional clustering algorithm. It is done by
computing the Graph Laplacian Matrix.

Python Code For Graph Laplacian Matrix


To compute it though first, the degree of a node needs to be defined. The
degree of the ith node is given bydi=∑j=1∣(i,j)ϵEnwij di=∑j=1∣(i,j)ϵEn
wij Note that wij wij is the edge between the nodes i and j as defined in the
adjacency matrix above.
Credit Card Data Clustering Using Spectral Clustering
The below steps demonstrate how to implement Spectral Clustering using
Sklearn. The data for the following steps is the Credit Card Data which can
be downloaded from Kaggle

Step 1: Importing the required libraries


We will first import all the libraries that are needed for this project

 Python3

import pandas as pd

import matplotlib.pyplot as plt

from sklearn.cluster import SpectralClustering

from sklearn.preprocessing import StandardScaler, normalize

from sklearn.decomposition import PCA

from sklearn.metrics import silhouette_score

Step 2: Loading and Cleaning the Data


 Python3

# Changing the working location to the location of the data

cd "C:\Users\Dev\Desktop\Kaggle\Credit_Card"
# Loading the data

X = pd.read_csv('CC_GENERAL.csv')

# Dropping the CUST_ID column from the data

X = X.drop('CUST_ID', axis = 1)

# Handling the missing values if any

X.fillna(method ='ffill', inplace = True)

X.head()

Output:

Step 3: Preprocessing the data to make the data visualizable


 Python3

# Preprocessing the data to make it visualizable

# Scaling the Data

scaler = StandardScaler()

X_scaled = scaler.fit_transform(X)
# Normalizing the Data

X_normalized = normalize(X_scaled)

# Converting the numpy array into a pandas DataFrame

X_normalized = pd.DataFrame(X_normalized)

# Reducing the dimensions of the data

pca = PCA(n_components = 2)

X_principal = pca.fit_transform(X_normalized)

X_principal = pd.DataFrame(X_principal)

X_principal.columns = ['P1', 'P2']

X_principal.head()

Step 4: Building the Clustering models and Visualizing the


Clustering
In the below steps, two different Spectral Clustering models with different
values for the parameter ‘affinity’. You can read about the documentation
of the Spectral Clustering class here. a) affinity = ‘rbf’
 Python3

# Building the clustering model


spectral_model_rbf = SpectralClustering(n_clusters = 2, affinity ='rbf')

# Training the model and Storing the predicted cluster labels

labels_rbf = spectral_model_rbf.fit_predict(X_principal)

 Python3

# Building the label to colour mapping

colours = {}

colours[0] = 'b'

colours[1] = 'y'

# Building the colour vector for each data point

cvec = [colours[label] for label in labels_rbf]

# Plotting the clustered scatter plot

b = plt.scatter(X_principal['P1'], X_principal['P2'], color ='b');

y = plt.scatter(X_principal['P1'], X_principal['P2'], color ='y');

plt.figure(figsize =(9, 9))

plt.scatter(X_principal['P1'], X_principal['P2'], c = cvec)

plt.legend((b, y), ('Label 0', 'Label 1'))


plt.show()

Output:

b) affinity = ‘nearest_neighbors’
 Python3

# Building the clustering model

spectral_model_nn = SpectralClustering(n_clusters = 2, affinity


='nearest_neighbors')

# Training the model and Storing the predicted cluster labels

labels_nn = spectral_model_nn.fit_predict(X_principal)

Output:
Step 5: Evaluating the performances
 Python3

# List of different values of affinity

affinity = ['rbf', 'nearest-neighbours']

# List of Silhouette Scores

s_scores = []

# Evaluating the performance

s_scores.append(silhouette_score(X, labels_rbf))

s_scores.append(silhouette_score(X, labels_nn))

print(s_scores)

Step 6: Comparing the performances


 Python3

# Plotting a Bar Graph to compare the models

plt.bar(affinity, s_scores)

plt.xlabel('Affinity')

plt.ylabel('Silhouette Score')

plt.title('Comparison of different Clustering Models')

plt.show()

Output:

Spectral Clustering is a type of clustering algorithm in machine learning that


uses eigenvectors of a similarity matrix to divide a set of data points into
clusters. The basic idea behind spectral clustering is to use the eigenvectors of
the Laplacian matrix of a graph to represent the data points and find clusters
by applying k-means or another clustering algorithm to the eigenvectors .

Advantages

 Effective for non-convex clusters.


 Can identify complex cluster shapes that traditional methods might miss.
 Relies on the graph representation, making it adaptable to various
similarity measures.

Disadvantages
 Requires choosing parameters like the number of clusters kkk, the
similarity measure, and its parameters (e.g., σ\sigmaσ in Gaussian
similarity).
 Computationally expensive for large datasets due to eigenvalue
decomposition.

Applications

 Image segmentation.
 Social network analysis.
 Document clustering.
 Bioinformatics (e.g., clustering genes or proteins).

You might also like