Clustering

The document provides an overview of machine learning techniques, focusing on supervised and unsupervised learning, particularly clustering. It explains the concepts of regression, classification, and the importance of data preparation, including feature scaling and finding the optimal number of clusters using methods like the elbow method and dendrograms. Additionally, it discusses hierarchical clustering and the silhouette coefficient as a measure of clustering quality.

Uploaded by

sahilhazrati

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views20 pages

Clustering

Uploaded by

sahilhazrati

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

Clustering

[Link] Sharma
[Link]@[Link]
LinkedIn:-[Link]
Type of
Machine
Leaning
Supervised Learning
• Supervised machine
learning algorithms uncover
insights, patterns, and relationships
from a labelled training dataset –
that is, a dataset that already
contains a known value for
the target variable for each record.
Supervise [Link]: Regression algorithms are used if
d Machine there is a relationship between the input
variable and the output variable. It is used for
Learning the prediction of continuous variables, such as
Weather forecasting, Market Trends, etc. Some
Technique popular Regression algorithms which come
under supervised learning:
s
• Linear Regression
• Regression Trees
• Non-Linear Regression
• Bayesian Linear Regression
• Polynomial Regression
• Suppose we have a dataset of different types of
shapes which includes square, rectangle, triangle, and
Polygon. Now the first step is that we need to train the
model for each shape.
• If the given shape has four sides, and all the sides are
equal, then it will be labelled as a Square.
• If the given shape has three sides, then it will be
labelled as a triangle.
• If the given shape has six equal sides, then it will be
labelled as hexagon.
• Now, after training, we test our model using the test
set, and the task of the model is to identify the shape.
• The machine is already trained on all types of shapes,
and when it finds a new shape, it classifies the shape
on the bases of a number of sides and predicts the
output.
2. Classification: Classification
algorithms are used when the output
variable is categorical, which means
there are two classes such as Yes-No,
Male-Female, True-false, etc.

Spam Filtering
• Random Forest
• Decision Trees
• Logistic Regression
• Support vector Machines
Unsupervised learning
• Unsupervised learning refers to the use of artificial intelligence
(AI) algorithms to identify patterns in data sets containing data
points that are neither classified nor labelled.
• In other words, unsupervised learning allows the system
to identify patterns within data sets on its own.
• In unsupervised learning, an AI system will group unsorted
information according to similarities and differences even
though there are no categories provided.
• The goal of unsupervised learning is to find the underlying
structure of dataset, group that data according to similarities,
and represent that dataset in a compressed format.
• Clustering is the task of dividing the population
or data points into a number of groups such
that data points in the same groups are more
similar to other data points in the same group
What is than those in other groups.

Clustering? • In simple words, the aim is to segregate groups

with similar traits and assign them into clusters.
• Clustering is the act of organizing similar
objects into groups within a machine learning
algorithm.
Data Preparation –
Feature Scaling
• All of the Machine Learning algorithm that are based on some distance
based metric, such as Euclidean distance, are adversely affected if the
input values are not standardized to a range.
• For example, suppose our K-Means model is based on two variables,
age and salary. Age variable will have much smaller value in comparison
with salary variable’s values. So during calculation of distance from
centroid for each observation, salary variable will dominate the distance
calculation effectively making the age variable unimportant. So, it
becomes important to scale the values of two variables to the same
fixed range, before training the model on them.
• Generally, it’s considered a good practice to scale your input data even if
you are not expecting it to make much difference, this is because even
the non-Euclidean distance based algorithms (such as gradient descent)
are known to converge faster to a solution if the data is scaled.
However, the actual decision also depends upon the context and
problem at hand.
Finding the Optimal Number of
Clusters (K): Scree Plot
• K-means clustering is widely used for exploratory data analysis. While its dependence on
initialization is well-known, it is common practice to assume that the partition with lowest sum-of-
squares (SSQ) total i.e. within cluster variance, is both reproducible under repeated initializations
and the closest that k-means can provide to true structure, when applied to synthetic data.

•There is no general theoretical solution to find the optimal number of cluster for any given dataset.
• One way to find the optimal value of K is to compare the resulting SSQ of multiple runs of K-Means
algorithm with different K values and choose the best one (having minimum SSQ). We can do this by
plotting the K value vs SSQ values and picking the optimal value for K from the graph. This method is
called the elbow method, as we look to find a bend or “elbow point” in the graph. Also, the graph is
called the Scree plot.
• We need to be careful while picking a K value because as K increases, SSQ will keep on decreasing
because with more clusters, points will be closer to the centroid.
Finding the
Optimal Number
of Clusters (K)

• For example, in the scree plot

shown on right, we can see that
initially SSQ decreased rapidly with
increasing number of clusters and
soon started to decrease gradually.
We can keep our number of clusters
to 3 in this case.
• For each K-Means fitted model,
Scikit-Learn provides a variable by
the name “interia_” that has the
value of total within-cluster SSE (or
cluster intertia). We can directly plot
this value against the corresponding
K to visualize
Cluster Analysis

Hierarchical Clustering
Hierarchical Clustering
• Hierarchical clustering is an alternative to K-Means clustering that can yield very different
clusters by seeking to build hierarchy of clusters.
• It is a distance-based algorithm that is based on the core idea of objects being more
related to nearby objects than to objects farther away.
• This clustering method lends itself to an intuitive graphical display in the form of a tree-
based representation of the observations, called a dendrogram, leading to easier
interpretation of the clusters.
• Another advantage of hierarchical clustering is that it doesn’t require us to pre-specify the
number of clusters K.
• Hierarchical clustering, however, is computationally expansive and doesn’t scale well on
large datasets.
Hierarchical Clustering
• Hierarchical clustering is an alternative to K-Means clustering that can yield very different clusters
by seeking to build hierarchy of clusters.
• It is a distance-based algorithm that is based on the core idea of objects being more related to
nearby objects than to objects farther away.
• This clustering method lends itself to an intuitive graphical display in the form of a tree-based
representation of the observations, called a dendrogram, leading to easier interpretation of the
clusters.
• Another advantage of hierarchical clustering is that it doesn’t require us to pre-specify the number
of clusters K.
• Hierarchical clustering, however, is computationally expansive and doesn’t scale well on large
datasets.
Interpreting a Dendrogram

• The name comes from the Greek words dendro (tree) and
gramma (drawing).
• A dendrogram is a visual representation of the observations and
the hierarchy of clusters to which they belong.
• We start from bottom where all of the observations are in a
cluster of their own.
• Next, based on some measure of distance, we group the most
similar clusters together, this is represented by fusing together the
clusters into one branch. The height at which the branch is
formed is directly related to how similar the observation are.
• Thus, observations that fuse at the very bottom of the tree are
quite similar to each other, whereas observations that fuse close
to the top of the tree will tend to be quite different. We interpret “similarity” between cluster by looking at
their distance on vertical axis, and not the horizontal axis
• We keep doing this while moving upwards until there is only
one cluster left.
Distance Metrics

• Hierarchical clustering supports various metrics for

measuring distance between clusters.
– Euclidean distance: The Euclidean distance
between two points is the length of the shortest
path connecting them.
– Manhattan distance: The Manhattan distance is
the sum of the lengths of the rectangle formed by
the two points.
– Cosine distance: The Cosine distance is the angle
subtended at the origin between the two points.
• Apart from the above metrics, you can also use a
precomputed distance matrix as a metric of distance.
Cutting a Dendrogram

• After creating a dendrogram we set a dissimilarity

threshold such that no cluster has more within-cluster
dissimilarity than this threshold value.
• We can visualize this as a horizontal line cutting through
our dendrogram.
• The ideal value of the threshold depends upon the context
and business case and will vary from case-to-case.
• The height of the cut to the dendrogram serves the same
role as the K in K-means clustering: it controls the number
of clusters obtained. However, we only need one
dendrogram to obtain any number of clusters. Original dendrogram 2 clusters at height 9 3 clusters at height 5
• One rule of thumb is to cut the dendrogram where the gap
between two successive combination similarities is largest.
Such large gaps arguably indicate "natural" clusterings.
Measuring Goodness of fit – Silhouette Coefficient
• Silhouette coefficient is an intrinsic method to evaluate the quality of a clustering algorithm. It is a measure
of how well each object lies within its cluster.
• Silhouette Coefficient is calculated in three steps:
1. Calculate the cluster cohesion: For any observation i, cohesion(a i) is defined as the average distance
between i and all other observations within the same cluster. It is a measure of how well i is assigned to
its cluster. The smaller its value, the better is this observation assignment to this cluster.
2. Calculate the cluster separation: For any observation i, separation(b i) is defined as the lowest average
distance of i to all points in any other cluster, of which i is not a member. The cluster with this lowest
average dissimilarity is said to be the "neighboring cluster" of i because it is the next best fit cluster for
point i.
3. Calculate Silhouette as:
s(i) =
• Hands on on Orange……..

Agglomerative Clustering
No ratings yet
Agglomerative Clustering
44 pages
Unsupervised Learning: Clustering Algorithms
No ratings yet
Unsupervised Learning: Clustering Algorithms
13 pages
Module 3 - 1
No ratings yet
Module 3 - 1
149 pages
Lecture+Notes+ +clustering
No ratings yet
Lecture+Notes+ +clustering
13 pages
Clustering - The Data Ensemble
No ratings yet
Clustering - The Data Ensemble
4 pages
22AIP3101A Session 9
No ratings yet
22AIP3101A Session 9
38 pages
Module 6 - Un-Supervised Learning Algorithms
No ratings yet
Module 6 - Un-Supervised Learning Algorithms
31 pages
Clustering
No ratings yet
Clustering
38 pages
Clustering
No ratings yet
Clustering
75 pages
Unit 2
No ratings yet
Unit 2
33 pages
Un Supervised Learning
No ratings yet
Un Supervised Learning
22 pages
Unit 3 Unsupervised Learning Algorith
No ratings yet
Unit 3 Unsupervised Learning Algorith
15 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
84 pages
K-Means Clustering Overview
No ratings yet
K-Means Clustering Overview
24 pages
Unsupervised Learning: Clustering
No ratings yet
Unsupervised Learning: Clustering
69 pages
4.unsupervised Learning Model-Clustering
No ratings yet
4.unsupervised Learning Model-Clustering
45 pages
Unit-4 New
No ratings yet
Unit-4 New
36 pages
Data Clustering..
No ratings yet
Data Clustering..
10 pages
21AI71 Module 5 Textbook
No ratings yet
21AI71 Module 5 Textbook
25 pages
Clustering Techniques Overview
No ratings yet
Clustering Techniques Overview
57 pages
Machine Learning Bloque 4
No ratings yet
Machine Learning Bloque 4
12 pages
Hierarchical Clustering
No ratings yet
Hierarchical Clustering
26 pages
Cluster Analysis
No ratings yet
Cluster Analysis
15 pages
Chapter 1 Introduction
No ratings yet
Chapter 1 Introduction
49 pages
Clustering Part1
No ratings yet
Clustering Part1
84 pages
Lesson 6 - Unsupervised Learning
No ratings yet
Lesson 6 - Unsupervised Learning
63 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
Unsupervised Learning for Students
No ratings yet
Unsupervised Learning for Students
59 pages
Clustering
No ratings yet
Clustering
7 pages
Unt III (DS)
No ratings yet
Unt III (DS)
49 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
17 pages
Model 3
No ratings yet
Model 3
31 pages
Unsupervised Learning: Clustering
No ratings yet
Unsupervised Learning: Clustering
57 pages
Mlclustering2022 10 26
No ratings yet
Mlclustering2022 10 26
36 pages
Unsupervised Learning 1
No ratings yet
Unsupervised Learning 1
40 pages
Final ML Unit3 May24
No ratings yet
Final ML Unit3 May24
154 pages
Cluster
100% (1)
Cluster
72 pages
Unsupervised Learning and Clustering Techniques
No ratings yet
Unsupervised Learning and Clustering Techniques
60 pages
Unit5 CSM ML
No ratings yet
Unit5 CSM ML
32 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
Artificial Intelligence Lec 5
No ratings yet
Artificial Intelligence Lec 5
20 pages
ML Mod 4 Part 1
No ratings yet
ML Mod 4 Part 1
99 pages
Understanding Cluster Analysis Techniques
No ratings yet
Understanding Cluster Analysis Techniques
98 pages
Clustering
No ratings yet
Clustering
75 pages
Unit 4 Self Made
No ratings yet
Unit 4 Self Made
28 pages
Intro to Clustering Methods
No ratings yet
Intro to Clustering Methods
39 pages
U1 - KMeans - 5th Sem - DS
No ratings yet
U1 - KMeans - 5th Sem - DS
14 pages
Artificial Intelligence Report
No ratings yet
Artificial Intelligence Report
23 pages
Clustering Techniques Explained
No ratings yet
Clustering Techniques Explained
91 pages
Unit 5
No ratings yet
Unit 5
5 pages
Unsupervised Learning: Clustering
No ratings yet
Unsupervised Learning: Clustering
12 pages
Clustering (Class 38-39)
No ratings yet
Clustering (Class 38-39)
45 pages
Cluster Evaluation Techniques: Atds Assignment
No ratings yet
Cluster Evaluation Techniques: Atds Assignment
4 pages
Customer Segmentation Techniques Explained
No ratings yet
Customer Segmentation Techniques Explained
46 pages
Unit 4
No ratings yet
Unit 4
74 pages
ML Unit III
No ratings yet
ML Unit III
82 pages
Hierarchical vs K-Means Clustering Guide
No ratings yet
Hierarchical vs K-Means Clustering Guide
23 pages
Cluster Analysis
No ratings yet
Cluster Analysis
37 pages
Clustering - Unit 4
No ratings yet
Clustering - Unit 4
19 pages
LH-5000 en
No ratings yet
LH-5000 en
2 pages
COVID-19 Prediction Using Regression
No ratings yet
COVID-19 Prediction Using Regression
5 pages
Mobile Vaccine Station Solutions
No ratings yet
Mobile Vaccine Station Solutions
4 pages
Class 5 Timetable
No ratings yet
Class 5 Timetable
2 pages
INDUSTRIAL ROBOTIC AUTOMATION Task 6
No ratings yet
INDUSTRIAL ROBOTIC AUTOMATION Task 6
3 pages
Analytical Meth For Riccati
No ratings yet
Analytical Meth For Riccati
7 pages
Roblox
No ratings yet
Roblox
70 pages
Programming Languages
No ratings yet
Programming Languages
3 pages
Appendix-1 AICTE
No ratings yet
Appendix-1 AICTE
9 pages
Kia Sportage Manual V4.1.4
No ratings yet
Kia Sportage Manual V4.1.4
15 pages
LIFT - Training Course Outline
No ratings yet
LIFT - Training Course Outline
389 pages
What Is Sound Art
No ratings yet
What Is Sound Art
4 pages
Untitled
No ratings yet
Untitled
1 page
MF PDF
No ratings yet
MF PDF
2 pages
421 110 - Falk Steelflex Grid Couplings - Catalog
No ratings yet
421 110 - Falk Steelflex Grid Couplings - Catalog
44 pages
Doosan Forklift Gc15 20 Tier 2 Service Manual
100% (44)
Doosan Forklift Gc15 20 Tier 2 Service Manual
8 pages
Assignment Day13 Lesson10 JDBC
No ratings yet
Assignment Day13 Lesson10 JDBC
11 pages
Computer System Validation Complete Guide 1710042264
100% (5)
Computer System Validation Complete Guide 1710042264
105 pages
Al Ain Sanaiya Projects - Approval Procedures
No ratings yet
Al Ain Sanaiya Projects - Approval Procedures
2 pages
CGPET 2012 Engineering Intake
No ratings yet
CGPET 2012 Engineering Intake
4 pages
SCM in Ikea02
No ratings yet
SCM in Ikea02
15 pages
Arduino Data Glasses Tutorial
No ratings yet
Arduino Data Glasses Tutorial
14 pages
Calcs Report
No ratings yet
Calcs Report
29 pages
Kashmir University Tech Assistant Jobs 2024
No ratings yet
Kashmir University Tech Assistant Jobs 2024
3 pages
Ramadhani Fithratullah, Environmental Engineer: Linkedin at Ramadhanifithratullah, Instagram at Fithratullah Profile
No ratings yet
Ramadhani Fithratullah, Environmental Engineer: Linkedin at Ramadhanifithratullah, Instagram at Fithratullah Profile
2 pages
Demo On Customer Exits
No ratings yet
Demo On Customer Exits
16 pages
7392 19924 1 SM
No ratings yet
7392 19924 1 SM
15 pages
Shopping Cart Project Report
No ratings yet
Shopping Cart Project Report
126 pages
Ostnaska - Technical Description - CiRo (EN)
No ratings yet
Ostnaska - Technical Description - CiRo (EN)
30 pages
7QC Tools
No ratings yet
7QC Tools
14 pages

Clustering

Uploaded by

Clustering

Uploaded by

Clustering

Clustering? • In simple words, the aim is to segregate groups

• For example, in the scree plot

• Hierarchical clustering supports various metrics for

• After creating a dendrogram we set a dissimilarity

You might also like