Clustering
[Link] Sharma
[Link]@[Link]
LinkedIn:-[Link]
Type of
Machine
Leaning
Supervised Learning
• Supervised machine
learning algorithms uncover
insights, patterns, and relationships
from a labelled training dataset –
that is, a dataset that already
contains a known value for
the target variable for each record.
Supervise [Link]: Regression algorithms are used if
d Machine there is a relationship between the input
variable and the output variable. It is used for
Learning the prediction of continuous variables, such as
Weather forecasting, Market Trends, etc. Some
Technique popular Regression algorithms which come
under supervised learning:
s
• Linear Regression
• Regression Trees
• Non-Linear Regression
• Bayesian Linear Regression
• Polynomial Regression
• Suppose we have a dataset of different types of
shapes which includes square, rectangle, triangle, and
Polygon. Now the first step is that we need to train the
model for each shape.
• If the given shape has four sides, and all the sides are
equal, then it will be labelled as a Square.
• If the given shape has three sides, then it will be
labelled as a triangle.
• If the given shape has six equal sides, then it will be
labelled as hexagon.
• Now, after training, we test our model using the test
set, and the task of the model is to identify the shape.
• The machine is already trained on all types of shapes,
and when it finds a new shape, it classifies the shape
on the bases of a number of sides and predicts the
output.
2. Classification: Classification
algorithms are used when the output
variable is categorical, which means
there are two classes such as Yes-No,
Male-Female, True-false, etc.
Spam Filtering
• Random Forest
• Decision Trees
• Logistic Regression
• Support vector Machines
Unsupervised learning
• Unsupervised learning refers to the use of artificial intelligence
(AI) algorithms to identify patterns in data sets containing data
points that are neither classified nor labelled.
• In other words, unsupervised learning allows the system
to identify patterns within data sets on its own.
• In unsupervised learning, an AI system will group unsorted
information according to similarities and differences even
though there are no categories provided.
• The goal of unsupervised learning is to find the underlying
structure of dataset, group that data according to similarities,
and represent that dataset in a compressed format.
• Clustering is the task of dividing the population
or data points into a number of groups such
that data points in the same groups are more
similar to other data points in the same group
What is than those in other groups.
Clustering? • In simple words, the aim is to segregate groups
with similar traits and assign them into clusters.
• Clustering is the act of organizing similar
objects into groups within a machine learning
algorithm.
Data Preparation –
Feature Scaling
• All of the Machine Learning algorithm that are based on some distance
based metric, such as Euclidean distance, are adversely affected if the
input values are not standardized to a range.
• For example, suppose our K-Means model is based on two variables,
age and salary. Age variable will have much smaller value in comparison
with salary variable’s values. So during calculation of distance from
centroid for each observation, salary variable will dominate the distance
calculation effectively making the age variable unimportant. So, it
becomes important to scale the values of two variables to the same
fixed range, before training the model on them.
• Generally, it’s considered a good practice to scale your input data even if
you are not expecting it to make much difference, this is because even
the non-Euclidean distance based algorithms (such as gradient descent)
are known to converge faster to a solution if the data is scaled.
However, the actual decision also depends upon the context and
problem at hand.
Finding the Optimal Number of
Clusters (K): Scree Plot
• K-means clustering is widely used for exploratory data analysis. While its dependence on
initialization is well-known, it is common practice to assume that the partition with lowest sum-of-
squares (SSQ) total i.e. within cluster variance, is both reproducible under repeated initializations
and the closest that k-means can provide to true structure, when applied to synthetic data.
•There is no general theoretical solution to find the optimal number of cluster for any given dataset.
• One way to find the optimal value of K is to compare the resulting SSQ of multiple runs of K-Means
algorithm with different K values and choose the best one (having minimum SSQ). We can do this by
plotting the K value vs SSQ values and picking the optimal value for K from the graph. This method is
called the elbow method, as we look to find a bend or “elbow point” in the graph. Also, the graph is
called the Scree plot.
• We need to be careful while picking a K value because as K increases, SSQ will keep on decreasing
because with more clusters, points will be closer to the centroid.
Finding the
Optimal Number
of Clusters (K)
• For example, in the scree plot
shown on right, we can see that
initially SSQ decreased rapidly with
increasing number of clusters and
soon started to decrease gradually.
We can keep our number of clusters
to 3 in this case.
• For each K-Means fitted model,
Scikit-Learn provides a variable by
the name “interia_” that has the
value of total within-cluster SSE (or
cluster intertia). We can directly plot
this value against the corresponding
K to visualize
Cluster Analysis
Hierarchical Clustering
Hierarchical Clustering
• Hierarchical clustering is an alternative to K-Means clustering that can yield very different
clusters by seeking to build hierarchy of clusters.
• It is a distance-based algorithm that is based on the core idea of objects being more
related to nearby objects than to objects farther away.
• This clustering method lends itself to an intuitive graphical display in the form of a tree-
based representation of the observations, called a dendrogram, leading to easier
interpretation of the clusters.
• Another advantage of hierarchical clustering is that it doesn’t require us to pre-specify the
number of clusters K.
• Hierarchical clustering, however, is computationally expansive and doesn’t scale well on
large datasets.
Hierarchical Clustering
• Hierarchical clustering is an alternative to K-Means clustering that can yield very different clusters
by seeking to build hierarchy of clusters.
• It is a distance-based algorithm that is based on the core idea of objects being more related to
nearby objects than to objects farther away.
• This clustering method lends itself to an intuitive graphical display in the form of a tree-based
representation of the observations, called a dendrogram, leading to easier interpretation of the
clusters.
• Another advantage of hierarchical clustering is that it doesn’t require us to pre-specify the number
of clusters K.
• Hierarchical clustering, however, is computationally expansive and doesn’t scale well on large
datasets.
Interpreting a Dendrogram
• The name comes from the Greek words dendro (tree) and
gramma (drawing).
• A dendrogram is a visual representation of the observations and
the hierarchy of clusters to which they belong.
• We start from bottom where all of the observations are in a
cluster of their own.
• Next, based on some measure of distance, we group the most
similar clusters together, this is represented by fusing together the
clusters into one branch. The height at which the branch is
formed is directly related to how similar the observation are.
• Thus, observations that fuse at the very bottom of the tree are
quite similar to each other, whereas observations that fuse close
to the top of the tree will tend to be quite different. We interpret “similarity” between cluster by looking at
their distance on vertical axis, and not the horizontal axis
• We keep doing this while moving upwards until there is only
one cluster left.
Distance Metrics
• Hierarchical clustering supports various metrics for
measuring distance between clusters.
– Euclidean distance: The Euclidean distance
between two points is the length of the shortest
path connecting them.
– Manhattan distance: The Manhattan distance is
the sum of the lengths of the rectangle formed by
the two points.
– Cosine distance: The Cosine distance is the angle
subtended at the origin between the two points.
• Apart from the above metrics, you can also use a
precomputed distance matrix as a metric of distance.
Cutting a Dendrogram
• After creating a dendrogram we set a dissimilarity
threshold such that no cluster has more within-cluster
dissimilarity than this threshold value.
• We can visualize this as a horizontal line cutting through
our dendrogram.
• The ideal value of the threshold depends upon the context
and business case and will vary from case-to-case.
• The height of the cut to the dendrogram serves the same
role as the K in K-means clustering: it controls the number
of clusters obtained. However, we only need one
dendrogram to obtain any number of clusters. Original dendrogram 2 clusters at height 9 3 clusters at height 5
• One rule of thumb is to cut the dendrogram where the gap
between two successive combination similarities is largest.
Such large gaps arguably indicate "natural" clusterings.
Measuring Goodness of fit – Silhouette Coefficient
• Silhouette coefficient is an intrinsic method to evaluate the quality of a clustering algorithm. It is a measure
of how well each object lies within its cluster.
• Silhouette Coefficient is calculated in three steps:
1. Calculate the cluster cohesion: For any observation i, cohesion(a i) is defined as the average distance
between i and all other observations within the same cluster. It is a measure of how well i is assigned to
its cluster. The smaller its value, the better is this observation assignment to this cluster.
2. Calculate the cluster separation: For any observation i, separation(b i) is defined as the lowest average
distance of i to all points in any other cluster, of which i is not a member. The cluster with this lowest
average dissimilarity is said to be the "neighboring cluster" of i because it is the next best fit cluster for
point i.
3. Calculate Silhouette as:
s(i) =
• Hands on on Orange……..