K Means Clustering Algorithm | K Means Clustering Example | Machine Learning Algorithms |Simplilearn

What is K-Means Clustering? What is K-Means Clustering?

What is K-Means Clustering? k-means performs division of objects into clusters which are “similar” between them and are “dissimilar” to the objects belonging to another cluster What is K-Means Clustering?

What is K-Means Clustering? Can you explain this with an example?

What is K-Means Clustering? Sure. For understanding K-Means in a better way, let’s take an example of Cricket Can you explain this with an example?

What is K-Means Clustering? Task: Identify bowlers and batsmen

What is K-Means Clustering? Task: Identify bowlers and batsmen  The data contains runs and wickets gained in the last 10 matches  So, the bowler will have more wickets and the batsmen will have higher runs Scores

What is K-Means Clustering? Assign data points Here, we have our dataset with x and y coordinates Now, we want to cluster this data using K-Means Runs Wickets

What is K-Means Clustering? Lorem ipsum Cluster 1Assign data points Lorem ipsum We can see that this cluster has players with high runs and low wickets Here, we have our dataset with x and y coordinates Now, we want to cluster this data using K-Means Runs Wickets Runs Wickets

What is K-Means Clustering? And here, we can see that this cluster has players with high wickets and low wickets Lorem ipsum Cluster 1 Cluster 2Assign data points Lorem ipsumLorem ipsum We can see that this cluster has players with high runs and low wickets Here, we have our dataset with x and y coordinates Now, we want to cluster this data using K-Means Runs Wickets Runs Wickets Runs Wickets

What is K-Means Clustering? Consider the same data set of cricket Solve the problem using K-Means

What is K-Means Clustering? Initially, two centroids are assigned randomly Euclidean distance to find out which centroid is closest to each data point and the data points are assigned to the corresponding centroids

What is K-Means Clustering? Reposition the two centroids for optimization.

What is K-Means Clustering? The process is iteratively repeated until our centroids become static

What’s in it for you? Types of Clustering What is K-Means Clustering? Applications of K-Means clustering Common distance measure How does K-Means clustering work? K-Means Clustering Algorithm Demo: K-Means Clustering Use Case: Color Compression

Types of Clustering Clustering Hierarchical Clustering Agglomerative Divisive Partitional Clustering K-Means Fuzzy C-Means

Types of Clustering Clustering Hierarchical Clustering Division Clusters have a tree like structure or a parent child relationship

Types of Clustering Clustering Hierarchical Clustering Agglomerative Divisive a b c fd e debc def bcdef abcdef “Bottom up" approach: Begin with each element as a separate cluster and merge them into successively larger clusters

Types of Clustering “Top down“ approach begin with the whole set and proceed to divide it into successively smaller clusters. a b c fd e de def bcdef abcdef bc Clustering Hierarchical Clustering Agglomerative Divisive

Types of Clustering Clustering Partitional Clustering K-Means Fuzzy C-Means c1 c2 Division of objects into clusters such that each object is in exactly one cluster, not several

Types of Clustering Clustering Partitional Clustering K-Means Fuzzy C-Means Division of objects into clusters such that each object can belong to multiple clusters c2c1

Applications of K-Means Clustering

Applications of K-Means Clustering Academic Performance Wireless Sensor Network's Diagnostic Systems Search Engines

Distance Measure Euclidean distance measure Manhattan distance measure Squared Euclidean distance measure Cosine distance measure Distance measure will determine the similarity between two elements and it will influence the shape of the clusters

Euclidean Distance Measure • The Euclidean distance is the "ordinary" straight line • It is the distance between two points in Euclidean space d=√ 𝑖=1 𝑛 ( 𝑞𝑖− )2 p q Euclidian Distance 𝑝𝑖 Option 02 Euclidean distance measure 01 Squared euclidean distance measure 02 Manhattan distance measure 03 Cosine distance measure 04

Squared Euclidean Distance Measure The Euclidean squared distance metric uses the same equation as the Euclidean distance metric, but does not take the square root. d= 𝑖=1 𝑛 ( 𝑞𝑖− )2 𝑝𝑖 Option 02 Euclidean distance measure 01 Squared euclidean distance measure 02 Manhattan distance measure 03 Cosine distance measure 04

Manhattan Distance Measure Option 02 Euclidean distance measure 01 Squared euclidean distance measure 02 Manhattan distance measure 03 Cosine distance measure 04 The Manhattan distance is the simple sum of the horizontal and vertical components or the distance between two points measured along axes at right angles d= 𝑖=1 𝑛 | 𝑞 𝑥− | p q Manhattan Distance 𝑝 𝑥 +|𝑞 𝑥− |𝑝 𝑦 (x,y) (x,y)

Cosine Distance Measure Option 02 Euclidean distance measure 01 Squared euclidean distance measure 02 Manhattan distance measure 03 Cosine distance measure 04 The cosine distance similarity measures the angle between the two vectors p q Cosine Distance 𝑖=0 𝑛−1 𝑞𝑖− 𝑖=0 𝑛−1 (𝑞𝑖)2 × 𝑖=0 𝑛−1 (𝑝𝑖)2 d= 𝑝 𝑥

How does K-Means clustering work?

How does K-Means clustering work? Start Elbow point (k) Reposition the centroids Grouping based on minimum distance Measure the distance Convergence - + If clusters are stable If clusters are unstable

How does K-Means clustering work? Elbow point Reposition the centroids Grouping Measure the distance Convergence • Let’s say, you have a dataset for a Grocery shop • Now, the important question is, “how would you choose the optimum number of clusters?“ ? c 1

How does K-Means clustering work? Elbow point Reposition the centroids Grouping Measure the distance Convergence • The best way to do this is by Elbow method • The idea of the elbow method is to run K-Means clustering on the dataset where ‘k’ is referred as number of clusters • Within sum of squares (WSS) is defined as the sum of the squared distance between each member of the cluster and its centroid 𝑖=1 𝑚 )𝑥𝑖 2 WSS = ( Where x𝑖 = data point and c𝑖 = closest point to centroid − 𝑐𝑖

How does K-Means clustering work? Elbow point Reposition the centroids Grouping Measure the distance Convergence • Now, we draw a curve between WSS (within sum of squares) and the number of clusters • Here, we can see a very slow change in the value of WSS after k=2, so you should take that elbow point value as the final number of clusters Elbow pointWSS No . of. clusters k=2

How does K-Means clustering work? Elbow point Reposition the centroids Grouping Measure the distance Convergence Step 1: The given data points below are assumed as delivery points c1

How does K-Means clustering work? Elbow point Reposition the centroids Grouping Measure the distance Convergence Step 2: We can randomly initialize two points called the cluster centroids, Euclidean distance is a distance measure used to find out which data point is closest to our centroids c1 c1 c2c 1 c2

How does K-Means clustering work? Elbow point Reposition the centroids Grouping Measure the distance Convergence Step 3: Based upon the distance from c1 and c2 centroids, the data points will group itself into clusters c1 c1 c2c 1 c2

How does K-Means clustering work? Elbow point Reposition the centroids Grouping Measure the distance Convergence Step 4: Compute the centroid of data points inside blue cluster Step 5: Reposition the centroid of the blue cluster to the new centroid c1 c1 c 1 c2

How does K-Means clustering work? Elbow point Reposition the centroids Grouping Measure the distance Convergence Step 6: Now, compute the centroid of data points inside orange cluster Step 7: Reposition the centroid of the orange cluster to the new centroid c1 c1 c2 c 1 c2

How does K-Means clustering work? Elbow point Reposition the centroids Grouping Measure the distance Convergence Step 8: Once the clusters become static, K-Means clustering algorithm is said to be converged c 1 c2

K-Means Clustering Algorithm Assuming we have inputs x1,x2,x3,…, and value of K, Step 1 : Pick K random points as cluster centers called centroids Step 2 : Assign each xi to nearest cluster by calculating its distance to each centroid Step 3 : Find new cluster center by taking the average of the assigned points Step 4 : Repeat Step 2 and 3 until none of the cluster assignments change

K-Means Clustering Algorithm Step 1 : We randomly pick K cluster centers (centroids). Let’s assume these are c1,c2,…,ckc1,c2,…,ck, and we can say that; C is the set of all centroids. Step 2: In this step, we assign each data point to closest center, this is done by calculating Euclidean distance arg min dist ( ,x )2 Where dist() is the Euclidean distance. 𝑐𝑖 ∈C𝑐𝑖 𝑐1 𝑐2 𝑐 𝑘C= , ,.…

|𝑆𝑖| = 1 ∑ Step 3: In this step, we find the new centroid by taking the average of all the points assigned to that cluster. is the set of all points assigned to the i th cluster Step 4: In this step, we repeat step 2 and 3 until none of the cluster assignments change That means until our clusters remain stable, we repeat the algorithm xi∈Si 𝑐𝑖 𝑥𝑖 𝑠𝑖 K-Means Clustering Algorithm

Demo: K-Means Clustering Problem Statement • Walmart wants to open a chain of stores across Florida and wants to find out optimal store locations to maximize revenue Solution • Walmart already has a strong e-commerce presence • Walmart can use its online customer data to analyze the customer locations along with the monthly sales

Demo: K-Means Clustering %matplotlib inline import matplotlib.pyplot as plt # for plot styling import seaborn as sns; sns.set() import numpy as np from sklearn.datasets.samples_generator import make_blobs X, y_true = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0) plt.scatter(X[:, 0], X[:, 1], s=50);

Demo: K-Means Clustering # output

Demo: K-Means Clustering # assign four clusters from sklearn.cluster import KMeans kmeans = KMeans(n_clusters=4) kmeans.fit(X) y_kmeans = kmeans.predict(X) # import library from sklearn.metrics import pairwise_distances_argmin def find_clusters(X, n_clusters, rseed=2): # 1. randomly choose clusters rng = np.random.RandomState(rseed) i = rng.permutation(X.shape[0])[:n_clusters] centers = X[i] while True:

Demo: K-Means Clustering # 2. assign labels based on closest center labels = pairwise_distances_argmin(X, centers) # 3. find new centers from means of points new_centers = np.array([X[labels == i].mean(0) for i in range(n_clusters)]) centers, labels = find_clusters(X, 4) plt.scatter(X[:, 0], X[:, 1], c=labels, s=50, cmap='viridis’)

Demo: K-Means Clustering # 4. check for convergence if np.all(centers == new_centers): break centers = new_centers return centers, labels centers, labels = find_clusters(X, 4) plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis') plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5);

Demo: K-Means Clustering # output: Conclusion Congratulations! We have demonstrated K-Means clustering by establishing Walmart stores across Florida in the most optimized way

Use case – Color compression

Use Case: K-Means for Color Compression Problem Statement To perform color compression on images using K-Means algorithm

Use Case: K-Means for Color Compression # example 1: # note: this requires the ``pillow`` package to be installed from sklearn.datasets import load_sample_image china = load_sample_image("flower.jpg") ax = plt.axes(xticks=[], yticks=[]) ax.imshow(china); #Output:

Use Case: K-Means for Color Compression # returns the dimensions of the array china.shape # reshape the data to [n_samples x n_features], and rescale the colors so that they lie between 0 and 1 data = china / 255.0 # use 0...1 scale data = data.reshape(427 * 640, 3) data.shape # visualize these pixels in this color space, using a subset of 10,000 pixels for efficiency def plot_pixels(data, title, colors=None, N=10000): if colors is None: colors = data

Use Case: K-Means for Color Compression # choose a random subset rng = np.random.RandomState(0) i = rng.permutation(data.shape[0])[:N] colors = colors[i] R, G, B = data[i].T fig, ax = plt.subplots(1, 2, figsize=(16, 6)) ax[0].scatter(R, G, color=colors, marker='.') ax[0].set(xlabel='Red', ylabel='Green', xlim=(0, 1), ylim=(0, 1)) ax[1].scatter(R, B, color=colors, marker='.') ax[1].set(xlabel='Red', ylabel='Blue', xlim=(0, 1), ylim=(0, 1)) fig.suptitle(title, size=20);

Use Case: K-Means for Color Compression plot_pixels(data, title='Input color space: 16 million possible colors')

Use Case: K-Means for Color Compression # fix numPy issues import warnings; warnings.simplefilter('ignore’) # reducing these 16 million colors to just 16 colors from sklearn.cluster import MiniBatchKMeans kmeans = MiniBatchKMeans(16) kmeans.fit(data) new_colors = kmeans.cluster_centers_[kmeans.predict(data)] plot_pixels(data, colors=new_colors, title="Reduced color space: 16 colors")

Use Case: K-Means for Color Compression china_recolored = new_colors.reshape(china.shape) fig, ax = plt.subplots(1, 2, figsize=(16, 6), subplot_kw=dict(xticks=[], yticks=[])) fig.subplots_adjust(wspace=0.05) ax[0].imshow(china) ax[0].set_title('Original Image', size=16) ax[1].imshow(china_recolored) ax[1].set_title('16-color Image', size=16); # the result is re-coloring of the original pixels, where each pixel is assigned the color of its closest cluster center # output:

Use Case: K-Means for Color Compression # output

Use Case: K-Means for Color Compression # example 2: from sklearn.datasets import load_sample_image china = load_sample_image(“china.jpg") ax = plt.axes(xticks=[], yticks=[]) ax.imshow(china);

Use Case: K-Means for Color Compression # fix NumPy issues import warnings; warnings.simplefilter('ignore’) # reducing these 16 million colors to just 16 colors from sklearn.cluster import MiniBatchKMeans kmeans = MiniBatchKMeans(16) kmeans.fit(data) new_colors = kmeans.cluster_centers_[kmeans.predict(data)] plot_pixels(data, colors=new_colors, title="Reduced color space: 16 colors")

Use Case: K-Means for Color Compression china_recolored = new_colors.reshape(china.shape) fig, ax = plt.subplots(1, 2, figsize=(16, 6), subplot_kw=dict(xticks=[], yticks=[])) fig.subplots_adjust(wspace=0.05) ax[0].imshow(china) ax[0].set_title('Original Image', size=16) ax[1].imshow(china_recolored) ax[1].set_title('16-color Image', size=16); # the result is a re-coloring of the original pixels, where each pixel is assigned the color of its closest cluster center # output

Use Case: K-Means for Color Compression # output Conclusion Congratulations! We have demonstrated K-Means in color compression. The hands on example will help you to encounter any K-Means project in future.

K Means Clustering Algorithm | K Means Clustering Example | Machine Learning Algorithms |Simplilearn

K Means Clustering Algorithm | K Means Clustering Example | Machine Learning Algorithms |Simplilearn

In this document