Kmeans clustering using machine learning

MACHINE LEARNING Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed.

Types of machine learning • K-Means Clustering • Gaussian Mixture Models • Dirichlet Process

K-Means Clustering Clustering: •is the classification of objects into different groups, or more precisely, the partitioning of a data set into subsets (clusters), so that the data in each subset share some common trait.

K-means Clustering Types of Clustering 1. Hierarchical 2. Partitional 3. Density Based Clustering 4. Fuzzy logic Clustering

K-means Clustering • a clustering algorithm in which the K-clusters are based on the closeness of data points to a reference point (centroid of a cluster). • I clusters n objects based on attributes into k partitions, where k < n. Terminology Centroid • A reference point of a given cluster. They are used to label new data i.e. determine which cluster a new data point belongs to.

K-means Clustering How it works The algorithm performs two major steps; 1. Data Assignment • Selection of centroids • Random selection • Random Generation • K-means ++ • Assign data points to centroids based on distance • Euclidean distance • Manhattan distance • Hamming distance • Inner product space

Manhattan Distance Euclidean Distance

Illustration Objective: To create 2 clusters from the set of numbers.(K=2, n=10)

n 2 8 1 2 3 4 5 6 7 8 9 10 Centroid Initialization

1 2 3 4 5 6 7 8 9 10 Data Assignment 7 8 2 8 1 1

n 2 8 1 1 7 2 0 6 3 1 5 4 2 4 5 3 3 6 4 2 7 5 1 8 6 0 9 7 1 10 8 2

K-means Clustering(How it works) 2. Centroid update step • Centroids are recomputed • based on mean of all data points assigned to a cluster(in step 1) • Steps 1 and 2 are run iteratively until; • Centroids don’t change i.e. distances is the same and data points do not change clusters • Some maximum number of iterations is reached. • Some other condition is fulfilled( e.g. minimum distance is achieved)

n 3 8 1 2 7 2 1 6 3 0 5 4 1 4 5 2 3 6 3 2 7 4 1 8 5 0 9 6 1 10 7 2 Step 2 (Centroid Update) (6,7,8,9,10)= 8 (1,2,3,4,5)=3 Mean Mean Centroids

3 8 1 2 7 2 1 6 3 0 5 4 1 4 5 2 3 6 3 2 7 4 1 8 5 0 9 6 1 10 7 2 Step 3 (Centroid Update) (6,7,8,9,10)= 8 (1,2,3,4,5)=3 Mean Mean

NOTE The centroids do not change after the 2nd iteration. Therefore we stop updating the centroids. Remember ! Our goal is to identify the optimum value of K.

K-means Clustering(How to pick the optimum k) • Minimize the within-cluster sum-of-squares( tighten clusters) and increase between-cluster sum of squares. Where •Sj is a specific cluster in K number of clusters. •Xn is a datapoint within a cluster Sj. •µj is the centroid of the cluster

There are 2 common methods ; 1. The Elbow Method I. Calculate the sum of squares for a cluster. II. Plot sum of squares against number of clusters, k. III. Observe change in sum of squares to select optimum K. K-means Clustering(How to pick the optimum k) Observe graph at k=3(the elbow), sum of squares does not reduce significantly after k>3 Elbow

Limitations with the elbow method is that the elbow might not be well defined. This can be overcome using the Silhouette method 2. Silhouette method • The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). • It ranges from [+1,-1] with +1 showing that the point is very close to its own cluster, -1 shows that the point is very similar to the neighboring cluster. K-means Clustering(How to pick the optimum k)

• Silhouette value s(i) of a point (i) is mathematically defined as Where; • b(i) is the mean distance of point (i) with respect to points in its neighboring cluster. • a(i) is the mean distance of point (i) with respect to points in its own cluster K-means Clustering(How to pick the optimum k)

K-means Clustering(advantages) • It is guaranteed to converge • Easily scales to large datasets • Has a linear time complexity O(tkn) • t – number of iterations • k – number of clusters • n – number of data points

K-means Clustering(Limitations) • k is chosen manually • Clusters are typically dependent on initial centroids • Outliers can drastically affect centroids • Can give unrealistic clusters i.e. (local optimum) • Organization/order of data may have an impact on results • Sensitive to scale

Applications  Pattern recognitions  Classification Analysis  Image processing  Machine Vision

References/Resources 1. https://blogs.oracle.com/datascience/introduction-to-k-means-clust ering 2. https://medium.com/analytics-vidhya/how-to-determine-the-optim al-k-for-k-means-708505d204eb 3. https://en.wikipedia.org/wiki/Silhouette_(clustering)

Kmeans clustering using machine learning

More Related Content

Similar to Kmeans clustering using machine learning

More from rajab ssemwogerere

Recently uploaded

Kmeans clustering using machine learning