K-Means Clustering
What is K-Means Clustering? What is K-Means Clustering?
What is K-Means Clustering? k-means performs division of objects into clusters which are β€œsimilar” between them and are β€œdissimilar” to the objects belonging to another cluster What is K-Means Clustering?
What is K-Means Clustering? Can you explain this with an example?
What is K-Means Clustering? Sure. For understanding K-Means in a better way, let’s take an example of Cricket Can you explain this with an example?
What is K-Means Clustering? Task: Identify bowlers and batsmen
What is K-Means Clustering? Task: Identify bowlers and batsmen  The data contains runs and wickets gained in the last 10 matches  So, the bowler will have more wickets and the batsmen will have higher runs Scores
What is K-Means Clustering? Assign data points Here, we have our dataset with x and y coordinates Now, we want to cluster this data using K-Means Runs Wickets
What is K-Means Clustering? Lorem ipsum Cluster 1Assign data points Lorem ipsum We can see that this cluster has players with high runs and low wickets Here, we have our dataset with x and y coordinates Now, we want to cluster this data using K-Means Runs Wickets Runs Wickets
What is K-Means Clustering? And here, we can see that this cluster has players with high wickets and low wickets Lorem ipsum Cluster 1 Cluster 2Assign data points Lorem ipsumLorem ipsum We can see that this cluster has players with high runs and low wickets Here, we have our dataset with x and y coordinates Now, we want to cluster this data using K-Means Runs Wickets Runs Wickets Runs Wickets
What is K-Means Clustering? Consider the same data set of cricket Solve the problem using K-Means
What is K-Means Clustering? Initially, two centroids are assigned randomly Euclidean distance to find out which centroid is closest to each data point and the data points are assigned to the corresponding centroids
What is K-Means Clustering? Reposition the two centroids for optimization.
What is K-Means Clustering? The process is iteratively repeated until our centroids become static
What’s in it for you? Types of Clustering What is K-Means Clustering? Applications of K-Means clustering Common distance measure How does K-Means clustering work? K-Means Clustering Algorithm Demo: K-Means Clustering Use Case: Color Compression
Types of Clustering
Types of Clustering Clustering Hierarchical Clustering Agglomerative Divisive Partitional Clustering K-Means Fuzzy C-Means
Types of Clustering Clustering Hierarchical Clustering Division Clusters have a tree like structure or a parent child relationship
Types of Clustering Clustering Hierarchical Clustering Agglomerative Divisive a b c fd e debc def bcdef abcdef β€œBottom up" approach: Begin with each element as a separate cluster and merge them into successively larger clusters
Types of Clustering β€œTop downβ€œ approach begin with the whole set and proceed to divide it into successively smaller clusters. a b c fd e de def bcdef abcdef bc Clustering Hierarchical Clustering Agglomerative Divisive
Types of Clustering Clustering Partitional Clustering K-Means Fuzzy C-Means c1 c2 Division of objects into clusters such that each object is in exactly one cluster, not several
Types of Clustering Clustering Partitional Clustering K-Means Fuzzy C-Means Division of objects into clusters such that each object can belong to multiple clusters c2c1
Applications of K-Means Clustering
Applications of K-Means Clustering Academic Performance Wireless Sensor Network's Diagnostic Systems Search Engines
Distance Measure
Distance Measure Euclidean distance measure Manhattan distance measure Squared Euclidean distance measure Cosine distance measure Distance measure will determine the similarity between two elements and it will influence the shape of the clusters
Euclidean Distance Measure β€’ The Euclidean distance is the "ordinary" straight line β€’ It is the distance between two points in Euclidean space d=√ 𝑖=1 𝑛 ( π‘žπ‘–βˆ’ )2 p q Euclidian Distance 𝑝𝑖 Option 02 Euclidean distance measure 01 Squared euclidean distance measure 02 Manhattan distance measure 03 Cosine distance measure 04
Squared Euclidean Distance Measure The Euclidean squared distance metric uses the same equation as the Euclidean distance metric, but does not take the square root. d= 𝑖=1 𝑛 ( π‘žπ‘–βˆ’ )2 𝑝𝑖 Option 02 Euclidean distance measure 01 Squared euclidean distance measure 02 Manhattan distance measure 03 Cosine distance measure 04
Manhattan Distance Measure Option 02 Euclidean distance measure 01 Squared euclidean distance measure 02 Manhattan distance measure 03 Cosine distance measure 04 The Manhattan distance is the simple sum of the horizontal and vertical components or the distance between two points measured along axes at right angles d= 𝑖=1 𝑛 | π‘ž π‘₯βˆ’ | p q Manhattan Distance 𝑝 π‘₯ +|π‘ž π‘₯βˆ’ |𝑝 𝑦 (x,y) (x,y)
Cosine Distance Measure Option 02 Euclidean distance measure 01 Squared euclidean distance measure 02 Manhattan distance measure 03 Cosine distance measure 04 The cosine distance similarity measures the angle between the two vectors p q Cosine Distance 𝑖=0 π‘›βˆ’1 π‘žπ‘–βˆ’ 𝑖=0 π‘›βˆ’1 (π‘žπ‘–)2 Γ— 𝑖=0 π‘›βˆ’1 (𝑝𝑖)2 d= 𝑝 π‘₯
How does K-Means clustering work?
How does K-Means clustering work? Start Elbow point (k) Reposition the centroids Grouping based on minimum distance Measure the distance Convergence - + If clusters are stable If clusters are unstable
How does K-Means clustering work? Elbow point Reposition the centroids Grouping Measure the distance Convergence β€’ Let’s say, you have a dataset for a Grocery shop β€’ Now, the important question is, β€œhow would you choose the optimum number of clusters?β€œ ? c 1
How does K-Means clustering work? Elbow point Reposition the centroids Grouping Measure the distance Convergence β€’ The best way to do this is by Elbow method β€’ The idea of the elbow method is to run K-Means clustering on the dataset where β€˜k’ is referred as number of clusters β€’ Within sum of squares (WSS) is defined as the sum of the squared distance between each member of the cluster and its centroid 𝑖=1 π‘š )π‘₯𝑖 2 WSS = ( Where x𝑖 = data point and c𝑖 = closest point to centroid βˆ’ 𝑐𝑖
How does K-Means clustering work? Elbow point Reposition the centroids Grouping Measure the distance Convergence β€’ Now, we draw a curve between WSS (within sum of squares) and the number of clusters β€’ Here, we can see a very slow change in the value of WSS after k=2, so you should take that elbow point value as the final number of clusters Elbow pointWSS No . of. clusters k=2
How does K-Means clustering work? Elbow point Reposition the centroids Grouping Measure the distance Convergence Step 1: The given data points below are assumed as delivery points c1
How does K-Means clustering work? Elbow point Reposition the centroids Grouping Measure the distance Convergence Step 2: We can randomly initialize two points called the cluster centroids, Euclidean distance is a distance measure used to find out which data point is closest to our centroids c1 c1 c2c 1 c2
How does K-Means clustering work? Elbow point Reposition the centroids Grouping Measure the distance Convergence Step 3: Based upon the distance from c1 and c2 centroids, the data points will group itself into clusters c1 c1 c2c 1 c2
How does K-Means clustering work? Elbow point Reposition the centroids Grouping Measure the distance Convergence Step 4: Compute the centroid of data points inside blue cluster Step 5: Reposition the centroid of the blue cluster to the new centroid c1 c1 c 1 c2
How does K-Means clustering work? Elbow point Reposition the centroids Grouping Measure the distance Convergence Step 6: Now, compute the centroid of data points inside orange cluster Step 7: Reposition the centroid of the orange cluster to the new centroid c1 c1 c2 c 1 c2
How does K-Means clustering work? Elbow point Reposition the centroids Grouping Measure the distance Convergence Step 8: Once the clusters become static, K-Means clustering algorithm is said to be converged c 1 c2
K-Means Clustering Algorithm
K-Means Clustering Algorithm Assuming we have inputs x1,x2,x3,…, and value of K, Step 1 : Pick K random points as cluster centers called centroids Step 2 : Assign each xi to nearest cluster by calculating its distance to each centroid Step 3 : Find new cluster center by taking the average of the assigned points Step 4 : Repeat Step 2 and 3 until none of the cluster assignments change
K-Means Clustering Algorithm Step 1 : We randomly pick K cluster centers (centroids). Let’s assume these are c1,c2,…,ckc1,c2,…,ck, and we can say that; C is the set of all centroids. Step 2: In this step, we assign each data point to closest center, this is done by calculating Euclidean distance arg min dist ( ,x )2 Where dist() is the Euclidean distance. 𝑐𝑖 ∈C𝑐𝑖 𝑐1 𝑐2 𝑐 π‘˜C= , ,.…
|𝑆𝑖| = 1 βˆ‘ Step 3: In this step, we find the new centroid by taking the average of all the points assigned to that cluster. is the set of all points assigned to the i th cluster Step 4: In this step, we repeat step 2 and 3 until none of the cluster assignments change That means until our clusters remain stable, we repeat the algorithm xi∈Si 𝑐𝑖 π‘₯𝑖 𝑠𝑖 K-Means Clustering Algorithm
Demo: K-Means Clustering
Demo: K-Means Clustering Problem Statement β€’ Walmart wants to open a chain of stores across Florida and wants to find out optimal store locations to maximize revenue Solution β€’ Walmart already has a strong e-commerce presence β€’ Walmart can use its online customer data to analyze the customer locations along with the monthly sales
Demo: K-Means Clustering %matplotlib inline import matplotlib.pyplot as plt # for plot styling import seaborn as sns; sns.set() import numpy as np from sklearn.datasets.samples_generator import make_blobs X, y_true = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0) plt.scatter(X[:, 0], X[:, 1], s=50);
Demo: K-Means Clustering # output
Demo: K-Means Clustering # assign four clusters from sklearn.cluster import KMeans kmeans = KMeans(n_clusters=4) kmeans.fit(X) y_kmeans = kmeans.predict(X) # import library from sklearn.metrics import pairwise_distances_argmin def find_clusters(X, n_clusters, rseed=2): # 1. randomly choose clusters rng = np.random.RandomState(rseed) i = rng.permutation(X.shape[0])[:n_clusters] centers = X[i] while True:
Demo: K-Means Clustering # 2. assign labels based on closest center labels = pairwise_distances_argmin(X, centers) # 3. find new centers from means of points new_centers = np.array([X[labels == i].mean(0) for i in range(n_clusters)]) centers, labels = find_clusters(X, 4) plt.scatter(X[:, 0], X[:, 1], c=labels, s=50, cmap='viridis’)
Demo: K-Means Clustering # 4. check for convergence if np.all(centers == new_centers): break centers = new_centers return centers, labels centers, labels = find_clusters(X, 4) plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis') plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5);
Demo: K-Means Clustering # output: Conclusion Congratulations! We have demonstrated K-Means clustering by establishing Walmart stores across Florida in the most optimized way
Use case – Color compression
Use Case: K-Means for Color Compression Problem Statement To perform color compression on images using K-Means algorithm
Use Case: K-Means for Color Compression # example 1: # note: this requires the ``pillow`` package to be installed from sklearn.datasets import load_sample_image china = load_sample_image("flower.jpg") ax = plt.axes(xticks=[], yticks=[]) ax.imshow(china); #Output:
Use Case: K-Means for Color Compression # returns the dimensions of the array china.shape # reshape the data to [n_samples x n_features], and rescale the colors so that they lie between 0 and 1 data = china / 255.0 # use 0...1 scale data = data.reshape(427 * 640, 3) data.shape # visualize these pixels in this color space, using a subset of 10,000 pixels for efficiency def plot_pixels(data, title, colors=None, N=10000): if colors is None: colors = data
Use Case: K-Means for Color Compression # choose a random subset rng = np.random.RandomState(0) i = rng.permutation(data.shape[0])[:N] colors = colors[i] R, G, B = data[i].T fig, ax = plt.subplots(1, 2, figsize=(16, 6)) ax[0].scatter(R, G, color=colors, marker='.') ax[0].set(xlabel='Red', ylabel='Green', xlim=(0, 1), ylim=(0, 1)) ax[1].scatter(R, B, color=colors, marker='.') ax[1].set(xlabel='Red', ylabel='Blue', xlim=(0, 1), ylim=(0, 1)) fig.suptitle(title, size=20);
Use Case: K-Means for Color Compression plot_pixels(data, title='Input color space: 16 million possible colors')
Use Case: K-Means for Color Compression # fix numPy issues import warnings; warnings.simplefilter('ignore’) # reducing these 16 million colors to just 16 colors from sklearn.cluster import MiniBatchKMeans kmeans = MiniBatchKMeans(16) kmeans.fit(data) new_colors = kmeans.cluster_centers_[kmeans.predict(data)] plot_pixels(data, colors=new_colors, title="Reduced color space: 16 colors")
Use Case: K-Means for Color Compression china_recolored = new_colors.reshape(china.shape) fig, ax = plt.subplots(1, 2, figsize=(16, 6), subplot_kw=dict(xticks=[], yticks=[])) fig.subplots_adjust(wspace=0.05) ax[0].imshow(china) ax[0].set_title('Original Image', size=16) ax[1].imshow(china_recolored) ax[1].set_title('16-color Image', size=16); # the result is re-coloring of the original pixels, where each pixel is assigned the color of its closest cluster center # output:
Use Case: K-Means for Color Compression # output
Use Case: K-Means for Color Compression # example 2: from sklearn.datasets import load_sample_image china = load_sample_image(β€œchina.jpg") ax = plt.axes(xticks=[], yticks=[]) ax.imshow(china);
Use Case: K-Means for Color Compression # returns the dimensions of the array china.shape # reshape the data to [n_samples x n_features], and rescale the colors so that they lie between 0 and 1 data = china / 255.0 # use 0...1 scale data = data.reshape(427 * 640, 3) data.shape # visualize these pixels in this color space, using a subset of 10,000 pixels for efficiency def plot_pixels(data, title, colors=None, N=10000): if colors is None: colors = data
Use Case: K-Means for Color Compression # choose a random subset rng = np.random.RandomState(0) i = rng.permutation(data.shape[0])[:N] colors = colors[i] R, G, B = data[i].T fig, ax = plt.subplots(1, 2, figsize=(16, 6)) ax[0].scatter(R, G, color=colors, marker='.') ax[0].set(xlabel='Red', ylabel='Green', xlim=(0, 1), ylim=(0, 1)) ax[1].scatter(R, B, color=colors, marker='.') ax[1].set(xlabel='Red', ylabel='Blue', xlim=(0, 1), ylim=(0, 1)) fig.suptitle(title, size=20);
Use Case: K-Means for Color Compression plot_pixels(data, title='Input color space: 16 million possible colors')
Use Case: K-Means for Color Compression # fix NumPy issues import warnings; warnings.simplefilter('ignore’) # reducing these 16 million colors to just 16 colors from sklearn.cluster import MiniBatchKMeans kmeans = MiniBatchKMeans(16) kmeans.fit(data) new_colors = kmeans.cluster_centers_[kmeans.predict(data)] plot_pixels(data, colors=new_colors, title="Reduced color space: 16 colors")
Use Case: K-Means for Color Compression china_recolored = new_colors.reshape(china.shape) fig, ax = plt.subplots(1, 2, figsize=(16, 6), subplot_kw=dict(xticks=[], yticks=[])) fig.subplots_adjust(wspace=0.05) ax[0].imshow(china) ax[0].set_title('Original Image', size=16) ax[1].imshow(china_recolored) ax[1].set_title('16-color Image', size=16); # the result is a re-coloring of the original pixels, where each pixel is assigned the color of its closest cluster center # output
Use Case: K-Means for Color Compression # output Conclusion Congratulations! We have demonstrated K-Means in color compression. The hands on example will help you to encounter any K-Means project in future.
Key Takeaways
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning Algorithms |Simplilearn

K Means Clustering Algorithm | K Means Clustering Example | Machine Learning Algorithms |Simplilearn

  • 1.
  • 2.
    What is K-MeansClustering? What is K-Means Clustering?
  • 3.
    What is K-MeansClustering? k-means performs division of objects into clusters which are β€œsimilar” between them and are β€œdissimilar” to the objects belonging to another cluster What is K-Means Clustering?
  • 4.
    What is K-MeansClustering? Can you explain this with an example?
  • 5.
    What is K-MeansClustering? Sure. For understanding K-Means in a better way, let’s take an example of Cricket Can you explain this with an example?
  • 6.
    What is K-MeansClustering? Task: Identify bowlers and batsmen
  • 7.
    What is K-MeansClustering? Task: Identify bowlers and batsmen  The data contains runs and wickets gained in the last 10 matches  So, the bowler will have more wickets and the batsmen will have higher runs Scores
  • 8.
    What is K-MeansClustering? Assign data points Here, we have our dataset with x and y coordinates Now, we want to cluster this data using K-Means Runs Wickets
  • 9.
    What is K-MeansClustering? Lorem ipsum Cluster 1Assign data points Lorem ipsum We can see that this cluster has players with high runs and low wickets Here, we have our dataset with x and y coordinates Now, we want to cluster this data using K-Means Runs Wickets Runs Wickets
  • 10.
    What is K-MeansClustering? And here, we can see that this cluster has players with high wickets and low wickets Lorem ipsum Cluster 1 Cluster 2Assign data points Lorem ipsumLorem ipsum We can see that this cluster has players with high runs and low wickets Here, we have our dataset with x and y coordinates Now, we want to cluster this data using K-Means Runs Wickets Runs Wickets Runs Wickets
  • 11.
    What is K-MeansClustering? Consider the same data set of cricket Solve the problem using K-Means
  • 12.
    What is K-MeansClustering? Initially, two centroids are assigned randomly Euclidean distance to find out which centroid is closest to each data point and the data points are assigned to the corresponding centroids
  • 13.
    What is K-MeansClustering? Reposition the two centroids for optimization.
  • 14.
    What is K-MeansClustering? The process is iteratively repeated until our centroids become static
  • 15.
    What’s in itfor you? Types of Clustering What is K-Means Clustering? Applications of K-Means clustering Common distance measure How does K-Means clustering work? K-Means Clustering Algorithm Demo: K-Means Clustering Use Case: Color Compression
  • 16.
  • 17.
    Types of Clustering Clustering Hierarchical Clustering AgglomerativeDivisive Partitional Clustering K-Means Fuzzy C-Means
  • 18.
    Types of Clustering Clustering Hierarchical Clustering Division Clustershave a tree like structure or a parent child relationship
  • 19.
    Types of Clustering Clustering Hierarchical Clustering AgglomerativeDivisive a b c fd e debc def bcdef abcdef β€œBottom up" approach: Begin with each element as a separate cluster and merge them into successively larger clusters
  • 20.
    Types of Clustering β€œTopdownβ€œ approach begin with the whole set and proceed to divide it into successively smaller clusters. a b c fd e de def bcdef abcdef bc Clustering Hierarchical Clustering Agglomerative Divisive
  • 21.
    Types of Clustering Clustering PartitionalClustering K-Means Fuzzy C-Means c1 c2 Division of objects into clusters such that each object is in exactly one cluster, not several
  • 22.
    Types of Clustering Clustering PartitionalClustering K-Means Fuzzy C-Means Division of objects into clusters such that each object can belong to multiple clusters c2c1
  • 23.
  • 24.
    Applications of K-MeansClustering Academic Performance Wireless Sensor Network's Diagnostic Systems Search Engines
  • 25.
  • 26.
    Distance Measure Euclidean distance measure Manhattan distance measure Squared Euclidean distancemeasure Cosine distance measure Distance measure will determine the similarity between two elements and it will influence the shape of the clusters
  • 27.
    Euclidean Distance Measure β€’The Euclidean distance is the "ordinary" straight line β€’ It is the distance between two points in Euclidean space d=√ 𝑖=1 𝑛 ( π‘žπ‘–βˆ’ )2 p q Euclidian Distance 𝑝𝑖 Option 02 Euclidean distance measure 01 Squared euclidean distance measure 02 Manhattan distance measure 03 Cosine distance measure 04
  • 28.
    Squared Euclidean DistanceMeasure The Euclidean squared distance metric uses the same equation as the Euclidean distance metric, but does not take the square root. d= 𝑖=1 𝑛 ( π‘žπ‘–βˆ’ )2 𝑝𝑖 Option 02 Euclidean distance measure 01 Squared euclidean distance measure 02 Manhattan distance measure 03 Cosine distance measure 04
  • 29.
    Manhattan Distance Measure Option02 Euclidean distance measure 01 Squared euclidean distance measure 02 Manhattan distance measure 03 Cosine distance measure 04 The Manhattan distance is the simple sum of the horizontal and vertical components or the distance between two points measured along axes at right angles d= 𝑖=1 𝑛 | π‘ž π‘₯βˆ’ | p q Manhattan Distance 𝑝 π‘₯ +|π‘ž π‘₯βˆ’ |𝑝 𝑦 (x,y) (x,y)
  • 30.
    Cosine Distance Measure Option02 Euclidean distance measure 01 Squared euclidean distance measure 02 Manhattan distance measure 03 Cosine distance measure 04 The cosine distance similarity measures the angle between the two vectors p q Cosine Distance 𝑖=0 π‘›βˆ’1 π‘žπ‘–βˆ’ 𝑖=0 π‘›βˆ’1 (π‘žπ‘–)2 Γ— 𝑖=0 π‘›βˆ’1 (𝑝𝑖)2 d= 𝑝 π‘₯
  • 31.
    How does K-Meansclustering work?
  • 32.
    How does K-Meansclustering work? Start Elbow point (k) Reposition the centroids Grouping based on minimum distance Measure the distance Convergence - + If clusters are stable If clusters are unstable
  • 33.
    How does K-Meansclustering work? Elbow point Reposition the centroids Grouping Measure the distance Convergence β€’ Let’s say, you have a dataset for a Grocery shop β€’ Now, the important question is, β€œhow would you choose the optimum number of clusters?β€œ ? c 1
  • 34.
    How does K-Meansclustering work? Elbow point Reposition the centroids Grouping Measure the distance Convergence β€’ The best way to do this is by Elbow method β€’ The idea of the elbow method is to run K-Means clustering on the dataset where β€˜k’ is referred as number of clusters β€’ Within sum of squares (WSS) is defined as the sum of the squared distance between each member of the cluster and its centroid 𝑖=1 π‘š )π‘₯𝑖 2 WSS = ( Where x𝑖 = data point and c𝑖 = closest point to centroid βˆ’ 𝑐𝑖
  • 35.
    How does K-Meansclustering work? Elbow point Reposition the centroids Grouping Measure the distance Convergence β€’ Now, we draw a curve between WSS (within sum of squares) and the number of clusters β€’ Here, we can see a very slow change in the value of WSS after k=2, so you should take that elbow point value as the final number of clusters Elbow pointWSS No . of. clusters k=2
  • 36.
    How does K-Meansclustering work? Elbow point Reposition the centroids Grouping Measure the distance Convergence Step 1: The given data points below are assumed as delivery points c1
  • 37.
    How does K-Meansclustering work? Elbow point Reposition the centroids Grouping Measure the distance Convergence Step 2: We can randomly initialize two points called the cluster centroids, Euclidean distance is a distance measure used to find out which data point is closest to our centroids c1 c1 c2c 1 c2
  • 38.
    How does K-Meansclustering work? Elbow point Reposition the centroids Grouping Measure the distance Convergence Step 3: Based upon the distance from c1 and c2 centroids, the data points will group itself into clusters c1 c1 c2c 1 c2
  • 39.
    How does K-Meansclustering work? Elbow point Reposition the centroids Grouping Measure the distance Convergence Step 4: Compute the centroid of data points inside blue cluster Step 5: Reposition the centroid of the blue cluster to the new centroid c1 c1 c 1 c2
  • 40.
    How does K-Meansclustering work? Elbow point Reposition the centroids Grouping Measure the distance Convergence Step 6: Now, compute the centroid of data points inside orange cluster Step 7: Reposition the centroid of the orange cluster to the new centroid c1 c1 c2 c 1 c2
  • 41.
    How does K-Meansclustering work? Elbow point Reposition the centroids Grouping Measure the distance Convergence Step 8: Once the clusters become static, K-Means clustering algorithm is said to be converged c 1 c2
  • 42.
  • 43.
    K-Means Clustering Algorithm Assumingwe have inputs x1,x2,x3,…, and value of K, Step 1 : Pick K random points as cluster centers called centroids Step 2 : Assign each xi to nearest cluster by calculating its distance to each centroid Step 3 : Find new cluster center by taking the average of the assigned points Step 4 : Repeat Step 2 and 3 until none of the cluster assignments change
  • 44.
    K-Means Clustering Algorithm Step1 : We randomly pick K cluster centers (centroids). Let’s assume these are c1,c2,…,ckc1,c2,…,ck, and we can say that; C is the set of all centroids. Step 2: In this step, we assign each data point to closest center, this is done by calculating Euclidean distance arg min dist ( ,x )2 Where dist() is the Euclidean distance. 𝑐𝑖 ∈C𝑐𝑖 𝑐1 𝑐2 𝑐 π‘˜C= , ,.…
  • 45.
    |𝑆𝑖| = 1 βˆ‘ Step3: In this step, we find the new centroid by taking the average of all the points assigned to that cluster. is the set of all points assigned to the i th cluster Step 4: In this step, we repeat step 2 and 3 until none of the cluster assignments change That means until our clusters remain stable, we repeat the algorithm xi∈Si 𝑐𝑖 π‘₯𝑖 𝑠𝑖 K-Means Clustering Algorithm
  • 46.
  • 47.
    Demo: K-Means Clustering ProblemStatement β€’ Walmart wants to open a chain of stores across Florida and wants to find out optimal store locations to maximize revenue Solution β€’ Walmart already has a strong e-commerce presence β€’ Walmart can use its online customer data to analyze the customer locations along with the monthly sales
  • 48.
    Demo: K-Means Clustering %matplotlibinline import matplotlib.pyplot as plt # for plot styling import seaborn as sns; sns.set() import numpy as np from sklearn.datasets.samples_generator import make_blobs X, y_true = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0) plt.scatter(X[:, 0], X[:, 1], s=50);
  • 49.
  • 50.
    Demo: K-Means Clustering #assign four clusters from sklearn.cluster import KMeans kmeans = KMeans(n_clusters=4) kmeans.fit(X) y_kmeans = kmeans.predict(X) # import library from sklearn.metrics import pairwise_distances_argmin def find_clusters(X, n_clusters, rseed=2): # 1. randomly choose clusters rng = np.random.RandomState(rseed) i = rng.permutation(X.shape[0])[:n_clusters] centers = X[i] while True:
  • 51.
    Demo: K-Means Clustering #2. assign labels based on closest center labels = pairwise_distances_argmin(X, centers) # 3. find new centers from means of points new_centers = np.array([X[labels == i].mean(0) for i in range(n_clusters)]) centers, labels = find_clusters(X, 4) plt.scatter(X[:, 0], X[:, 1], c=labels, s=50, cmap='viridis’)
  • 52.
    Demo: K-Means Clustering #4. check for convergence if np.all(centers == new_centers): break centers = new_centers return centers, labels centers, labels = find_clusters(X, 4) plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis') plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5);
  • 53.
    Demo: K-Means Clustering #output: Conclusion Congratulations! We have demonstrated K-Means clustering by establishing Walmart stores across Florida in the most optimized way
  • 54.
    Use case –Color compression
  • 55.
    Use Case: K-Meansfor Color Compression Problem Statement To perform color compression on images using K-Means algorithm
  • 56.
    Use Case: K-Meansfor Color Compression # example 1: # note: this requires the ``pillow`` package to be installed from sklearn.datasets import load_sample_image china = load_sample_image("flower.jpg") ax = plt.axes(xticks=[], yticks=[]) ax.imshow(china); #Output:
  • 57.
    Use Case: K-Meansfor Color Compression # returns the dimensions of the array china.shape # reshape the data to [n_samples x n_features], and rescale the colors so that they lie between 0 and 1 data = china / 255.0 # use 0...1 scale data = data.reshape(427 * 640, 3) data.shape # visualize these pixels in this color space, using a subset of 10,000 pixels for efficiency def plot_pixels(data, title, colors=None, N=10000): if colors is None: colors = data
  • 58.
    Use Case: K-Meansfor Color Compression # choose a random subset rng = np.random.RandomState(0) i = rng.permutation(data.shape[0])[:N] colors = colors[i] R, G, B = data[i].T fig, ax = plt.subplots(1, 2, figsize=(16, 6)) ax[0].scatter(R, G, color=colors, marker='.') ax[0].set(xlabel='Red', ylabel='Green', xlim=(0, 1), ylim=(0, 1)) ax[1].scatter(R, B, color=colors, marker='.') ax[1].set(xlabel='Red', ylabel='Blue', xlim=(0, 1), ylim=(0, 1)) fig.suptitle(title, size=20);
  • 59.
    Use Case: K-Meansfor Color Compression plot_pixels(data, title='Input color space: 16 million possible colors')
  • 60.
    Use Case: K-Meansfor Color Compression # fix numPy issues import warnings; warnings.simplefilter('ignore’) # reducing these 16 million colors to just 16 colors from sklearn.cluster import MiniBatchKMeans kmeans = MiniBatchKMeans(16) kmeans.fit(data) new_colors = kmeans.cluster_centers_[kmeans.predict(data)] plot_pixels(data, colors=new_colors, title="Reduced color space: 16 colors")
  • 61.
    Use Case: K-Meansfor Color Compression china_recolored = new_colors.reshape(china.shape) fig, ax = plt.subplots(1, 2, figsize=(16, 6), subplot_kw=dict(xticks=[], yticks=[])) fig.subplots_adjust(wspace=0.05) ax[0].imshow(china) ax[0].set_title('Original Image', size=16) ax[1].imshow(china_recolored) ax[1].set_title('16-color Image', size=16); # the result is re-coloring of the original pixels, where each pixel is assigned the color of its closest cluster center # output:
  • 62.
    Use Case: K-Meansfor Color Compression # output
  • 63.
    Use Case: K-Meansfor Color Compression # example 2: from sklearn.datasets import load_sample_image china = load_sample_image(β€œchina.jpg") ax = plt.axes(xticks=[], yticks=[]) ax.imshow(china);
  • 64.
    Use Case: K-Meansfor Color Compression # returns the dimensions of the array china.shape # reshape the data to [n_samples x n_features], and rescale the colors so that they lie between 0 and 1 data = china / 255.0 # use 0...1 scale data = data.reshape(427 * 640, 3) data.shape # visualize these pixels in this color space, using a subset of 10,000 pixels for efficiency def plot_pixels(data, title, colors=None, N=10000): if colors is None: colors = data
  • 65.
    Use Case: K-Meansfor Color Compression # choose a random subset rng = np.random.RandomState(0) i = rng.permutation(data.shape[0])[:N] colors = colors[i] R, G, B = data[i].T fig, ax = plt.subplots(1, 2, figsize=(16, 6)) ax[0].scatter(R, G, color=colors, marker='.') ax[0].set(xlabel='Red', ylabel='Green', xlim=(0, 1), ylim=(0, 1)) ax[1].scatter(R, B, color=colors, marker='.') ax[1].set(xlabel='Red', ylabel='Blue', xlim=(0, 1), ylim=(0, 1)) fig.suptitle(title, size=20);
  • 66.
    Use Case: K-Meansfor Color Compression plot_pixels(data, title='Input color space: 16 million possible colors')
  • 67.
    Use Case: K-Meansfor Color Compression # fix NumPy issues import warnings; warnings.simplefilter('ignore’) # reducing these 16 million colors to just 16 colors from sklearn.cluster import MiniBatchKMeans kmeans = MiniBatchKMeans(16) kmeans.fit(data) new_colors = kmeans.cluster_centers_[kmeans.predict(data)] plot_pixels(data, colors=new_colors, title="Reduced color space: 16 colors")
  • 68.
    Use Case: K-Meansfor Color Compression china_recolored = new_colors.reshape(china.shape) fig, ax = plt.subplots(1, 2, figsize=(16, 6), subplot_kw=dict(xticks=[], yticks=[])) fig.subplots_adjust(wspace=0.05) ax[0].imshow(china) ax[0].set_title('Original Image', size=16) ax[1].imshow(china_recolored) ax[1].set_title('16-color Image', size=16); # the result is a re-coloring of the original pixels, where each pixel is assigned the color of its closest cluster center # output
  • 69.
    Use Case: K-Meansfor Color Compression # output Conclusion Congratulations! We have demonstrated K-Means in color compression. The hands on example will help you to encounter any K-Means project in future.
  • 70.

Editor's Notes