K-means clustering algorithm Kasun Ranga Wijeweera (krw19870829@gmail.com)
What is Clustering? • Organizing data into classes such that there is • high intra-class similarity • low inter-class similarity • Finding the class labels and the number of classes directly from the data (in contrast to classification). • More informally, finding natural groupings among objects.
What is a natural grouping among these objects?
What is a natural grouping among these objects? Clustering is subjective Simpson's Family School Employees Females Males
Defining Distance Measures Definition: Let O1 and O2 be two objects from the universe of possible objects. The distance (dissimilarity) between O1 and O2 is a real number denoted by D(O1,O2) Kasun Kiosn 0.23 3 342.7
Consider a Set of Data Points, And a Set of Clusters,
The Goal,
Algorithm k-means 1. Randomly choose K data items from X as initial centroids. 2. Repeat  Assign each data point to the cluster which has the closest centroid.  Calculate new cluster centroids. Until the convergence criteria is met.
The data points
Initialization
#Runs = 1
#Runs = 2
#Runs = 3
K-means gets stuck in a local optima
The data points
Initialization
#Runs = 1
#Runs = 2
#Runs = 3
#Runs = 4
Applications of K-means Method • Optical Character Recognition • Biometrics • Diagnostic Systems • Military Applications
Comments on the K-Means Method • Strength – Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n. – Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing and genetic algorithms • Weakness – Applicable only when mean is defined, then what about categorical data? – Need to specify k, the number of clusters, in advance – Unable to handle noisy data and outliers – Not suitable to discover clusters with non-convex shapes
Any Questions ?
Thanks for your attention !

K means Clustering Algorithm

  • 1.
    K-means clustering algorithm Kasun Ranga Wijeweera (krw19870829@gmail.com)
  • 2.
    What is Clustering? •Organizing data into classes such that there is • high intra-class similarity • low inter-class similarity • Finding the class labels and the number of classes directly from the data (in contrast to classification). • More informally, finding natural groupings among objects.
  • 3.
    What is anatural grouping among these objects?
  • 4.
    What is anatural grouping among these objects? Clustering is subjective Simpson's Family School Employees Females Males
  • 5.
    Defining Distance Measures Definition:Let O1 and O2 be two objects from the universe of possible objects. The distance (dissimilarity) between O1 and O2 is a real number denoted by D(O1,O2) Kasun Kiosn 0.23 3 342.7
  • 6.
    Consider a Setof Data Points, And a Set of Clusters,
  • 7.
  • 8.
    Algorithm k-means 1. Randomlychoose K data items from X as initial centroids. 2. Repeat  Assign each data point to the cluster which has the closest centroid.  Calculate new cluster centroids. Until the convergence criteria is met.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
    K-means gets stuckin a local optima
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
    Applications of K-meansMethod • Optical Character Recognition • Biometrics • Diagnostic Systems • Military Applications
  • 22.
    Comments on theK-Means Method • Strength – Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n. – Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing and genetic algorithms • Weakness – Applicable only when mean is defined, then what about categorical data? – Need to specify k, the number of clusters, in advance – Unable to handle noisy data and outliers – Not suitable to discover clusters with non-convex shapes
  • 23.
  • 24.
    Thanks for yourattention !