MACHINE LEARNING Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed.
Types of machine learning • K-Means Clustering • Gaussian Mixture Models • Dirichlet Process
K-Means Clustering Clustering: •is the classification of objects into different groups, or more precisely, the partitioning of a data set into subsets (clusters), so that the data in each subset share some common trait.
K-means Clustering Types of Clustering 1. Hierarchical 2. Partitional 3. Density Based Clustering 4. Fuzzy logic Clustering
K-means Clustering • a clustering algorithm in which the K-clusters are based on the closeness of data points to a reference point (centroid of a cluster). • I clusters n objects based on attributes into k partitions, where k < n. Terminology Centroid • A reference point of a given cluster. They are used to label new data i.e. determine which cluster a new data point belongs to.
K-means Clustering How it works The algorithm performs two major steps; 1. Data Assignment • Selection of centroids • Random selection • Random Generation • K-means ++ • Assign data points to centroids based on distance • Euclidean distance • Manhattan distance • Hamming distance • Inner product space
Manhattan Distance Euclidean Distance
Illustration Objective: To create 2 clusters from the set of numbers.(K=2, n=10)
1 2 3 4 5 6 7 8 9 10
n 2 8 1 2 3 4 5 6 7 8 9 10 Centroid Initialization
1 2 3 4 5 6 7 8 9 10 Data Assignment 7 8 2 8 1 1
n 2 8 1 1 7 2 0 6 3 1 5 4 2 4 5 3 3 6 4 2 7 5 1 8 6 0 9 7 1 10 8 2
K-means Clustering(How it works) 2. Centroid update step • Centroids are recomputed • based on mean of all data points assigned to a cluster(in step 1) • Steps 1 and 2 are run iteratively until; • Centroids don’t change i.e. distances is the same and data points do not change clusters • Some maximum number of iterations is reached. • Some other condition is fulfilled( e.g. minimum distance is achieved)
n 3 8 1 2 7 2 1 6 3 0 5 4 1 4 5 2 3 6 3 2 7 4 1 8 5 0 9 6 1 10 7 2 Step 2 (Centroid Update) (6,7,8,9,10)= 8 (1,2,3,4,5)=3 Mean Mean Centroids
3 8 1 2 7 2 1 6 3 0 5 4 1 4 5 2 3 6 3 2 7 4 1 8 5 0 9 6 1 10 7 2 Step 3 (Centroid Update) (6,7,8,9,10)= 8 (1,2,3,4,5)=3 Mean Mean
NOTE The centroids do not change after the 2nd iteration. Therefore we stop updating the centroids. Remember ! Our goal is to identify the optimum value of K.
K-means Clustering(How to pick the optimum k) • Minimize the within-cluster sum-of-squares( tighten clusters) and increase between-cluster sum of squares. Where •Sj is a specific cluster in K number of clusters. •Xn is a datapoint within a cluster Sj. •µj is the centroid of the cluster
There are 2 common methods ; 1. The Elbow Method I. Calculate the sum of squares for a cluster. II. Plot sum of squares against number of clusters, k. III. Observe change in sum of squares to select optimum K. K-means Clustering(How to pick the optimum k) Observe graph at k=3(the elbow), sum of squares does not reduce significantly after k>3 Elbow
Limitations with the elbow method is that the elbow might not be well defined. This can be overcome using the Silhouette method 2. Silhouette method • The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). • It ranges from [+1,-1] with +1 showing that the point is very close to its own cluster, -1 shows that the point is very similar to the neighboring cluster. K-means Clustering(How to pick the optimum k)
• Silhouette value s(i) of a point (i) is mathematically defined as Where; • b(i) is the mean distance of point (i) with respect to points in its neighboring cluster. • a(i) is the mean distance of point (i) with respect to points in its own cluster K-means Clustering(How to pick the optimum k)
K-means Clustering(advantages) • It is guaranteed to converge • Easily scales to large datasets • Has a linear time complexity O(tkn) • t – number of iterations • k – number of clusters • n – number of data points
K-means Clustering(Limitations) • k is chosen manually • Clusters are typically dependent on initial centroids • Outliers can drastically affect centroids • Can give unrealistic clusters i.e. (local optimum) • Organization/order of data may have an impact on results • Sensitive to scale
Applications  Pattern recognitions  Classification Analysis  Image processing  Machine Vision
References/Resources 1. https://blogs.oracle.com/datascience/introduction-to-k-means-clust ering 2. https://medium.com/analytics-vidhya/how-to-determine-the-optim al-k-for-k-means-708505d204eb 3. https://en.wikipedia.org/wiki/Silhouette_(clustering)

Kmeans clustering using machine learning

  • 1.
    MACHINE LEARNING Machine learning isan application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed.
  • 2.
    Types of machinelearning • K-Means Clustering • Gaussian Mixture Models • Dirichlet Process
  • 3.
    K-Means Clustering Clustering: •is theclassification of objects into different groups, or more precisely, the partitioning of a data set into subsets (clusters), so that the data in each subset share some common trait.
  • 4.
    K-means Clustering Types ofClustering 1. Hierarchical 2. Partitional 3. Density Based Clustering 4. Fuzzy logic Clustering
  • 5.
    K-means Clustering • aclustering algorithm in which the K-clusters are based on the closeness of data points to a reference point (centroid of a cluster). • I clusters n objects based on attributes into k partitions, where k < n. Terminology Centroid • A reference point of a given cluster. They are used to label new data i.e. determine which cluster a new data point belongs to.
  • 6.
    K-means Clustering How itworks The algorithm performs two major steps; 1. Data Assignment • Selection of centroids • Random selection • Random Generation • K-means ++ • Assign data points to centroids based on distance • Euclidean distance • Manhattan distance • Hamming distance • Inner product space
  • 7.
  • 8.
    Illustration Objective: To create2 clusters from the set of numbers.(K=2, n=10)
  • 9.
  • 10.
  • 11.
  • 12.
    n 2 8 11 7 2 0 6 3 1 5 4 2 4 5 3 3 6 4 2 7 5 1 8 6 0 9 7 1 10 8 2
  • 13.
    K-means Clustering(How itworks) 2. Centroid update step • Centroids are recomputed • based on mean of all data points assigned to a cluster(in step 1) • Steps 1 and 2 are run iteratively until; • Centroids don’t change i.e. distances is the same and data points do not change clusters • Some maximum number of iterations is reached. • Some other condition is fulfilled( e.g. minimum distance is achieved)
  • 14.
    n 3 8 12 7 2 1 6 3 0 5 4 1 4 5 2 3 6 3 2 7 4 1 8 5 0 9 6 1 10 7 2 Step 2 (Centroid Update) (6,7,8,9,10)= 8 (1,2,3,4,5)=3 Mean Mean Centroids
  • 15.
    3 8 1 27 2 1 6 3 0 5 4 1 4 5 2 3 6 3 2 7 4 1 8 5 0 9 6 1 10 7 2 Step 3 (Centroid Update) (6,7,8,9,10)= 8 (1,2,3,4,5)=3 Mean Mean
  • 16.
    NOTE The centroids donot change after the 2nd iteration. Therefore we stop updating the centroids. Remember ! Our goal is to identify the optimum value of K.
  • 17.
    K-means Clustering(How topick the optimum k) • Minimize the within-cluster sum-of-squares( tighten clusters) and increase between-cluster sum of squares. Where •Sj is a specific cluster in K number of clusters. •Xn is a datapoint within a cluster Sj. •µj is the centroid of the cluster
  • 18.
    There are 2common methods ; 1. The Elbow Method I. Calculate the sum of squares for a cluster. II. Plot sum of squares against number of clusters, k. III. Observe change in sum of squares to select optimum K. K-means Clustering(How to pick the optimum k) Observe graph at k=3(the elbow), sum of squares does not reduce significantly after k>3 Elbow
  • 19.
    Limitations with theelbow method is that the elbow might not be well defined. This can be overcome using the Silhouette method 2. Silhouette method • The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). • It ranges from [+1,-1] with +1 showing that the point is very close to its own cluster, -1 shows that the point is very similar to the neighboring cluster. K-means Clustering(How to pick the optimum k)
  • 20.
    • Silhouette values(i) of a point (i) is mathematically defined as Where; • b(i) is the mean distance of point (i) with respect to points in its neighboring cluster. • a(i) is the mean distance of point (i) with respect to points in its own cluster K-means Clustering(How to pick the optimum k)
  • 21.
    K-means Clustering(advantages) • Itis guaranteed to converge • Easily scales to large datasets • Has a linear time complexity O(tkn) • t – number of iterations • k – number of clusters • n – number of data points
  • 22.
    K-means Clustering(Limitations) • kis chosen manually • Clusters are typically dependent on initial centroids • Outliers can drastically affect centroids • Can give unrealistic clusters i.e. (local optimum) • Organization/order of data may have an impact on results • Sensitive to scale
  • 23.
    Applications  Pattern recognitions Classification Analysis  Image processing  Machine Vision
  • 24.