MIXED NUMERIC AND CATEGORICAL ATTRIBUTE CLUSTERING ALGORITHM MODELING DR. ASOKA KORALE, C.ENG. MIET & MIESL
ADVANTAGES TO NUMERIC AND CATEGORICAL ATTRIBUTE CLUSTERING Slide | 2 Improved Targeting in Campaigns & Insight in to Segments Currently clustering on numeric variables Age, Net Stay, ARPU PRIMARY ATTRIBUTES THAT CAN BE INCLUDED WITH MIXED ATTRIBUTE TYPE CLUSTERING – ACCOUNT TYPE, GENDER, GEO LOCATION, …… Currently Fuzzy C – Means Algorithm used in Clustering Digital Advertizing SEGMENTATIONS INCREASINGLY BASED ON CLUSTERING Include other Categorical attributes depending on Interest segment to create –”Micro Segments”
WIDENING POTENTIAL INSIGHTS THROUGH CATEGORICAL CLUSTERING Slide | 3 Improved Targeting in Campaigns & All Attributes Can be Clustered – leading to very specific and wider array of segments Geographic attribute clustering to incorporate Income/ARPU hotspots at micro level
CONCEPT UNDERLYING THE MIXED K PROTOTYPES ALGORITHM [1] Slide | 4 point “d” and point “c” may switch sides depending on how similar the numeric part and categorical part of the point is similar to the numeric and categorical part of the centroid (prototype) Influence or contribution of Numeric and Categorical Attributes of a data point can be controlled via a parameter “gamma” Point “a” may switch if the categorical part is closer to the categorical centroid (prototype) more than its numeric part is close to the numeric part of the centroid. Numeric and Categorical Attributes parts of a data point can be considered separately and two sets of centroids act as attractors for each Attribute type in each cluster Numeric Attribute1 Shapes represent two values of a single categorical variable Numeric Attribute2 [1]. Huang, CSIRO, Australia
MIXED K PROTOTYPES ALGORITHM [1] Slide | 5 Distance measure to a prototype (center) of two parts – numeric and categorical Numeric Attributes - Euclidian Distance Categorical Attributes – Dissimilarity Measure Centroid of Numeric Attributes – a simple average of the points in that cluster Includes “Yij” a fuzzy membership function if we wish to go in that direction
MIXED K PROTOTYPES ALGORITHM [1] Slide | 6 Minimize the total cost “E” which is the sum of the distances to the numeric and categorical parts of the centroid (prototype) Centroid of Categorical attributes determined on highest frequency of attribute value in each cluster
Slide | 7 CONVERGENCE PERFORMANCE 0 5 10 15 20 25 30 35 40 0 200 400 600 800 1000 1200 1400 1600 Total no of switches at each iteration IterationNumber 0 5 10 15 20 25 30 35 40 1.2 1.3 1.4 1.5 1.6 1.7 1.8 x 10 4 IterationNumber Total Distance at eachiteration 1 2 3 4 5 6 7 8 0 5 10 15 20 25 30 35 40 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 IterationNumber Total Categorical Distance at eachiteration 1 2 3 4 5 6 7 8
Slide | 8 CLUSTER & SEGMENT PROFILE 1 2 3 4 5 6 7 8 0 200 400 600 800 1000 1200 Number of Cx in eachCluster Cluster ID 20 30 40 50 60 70 80 90 1 2 3 4 5 6 7 8 Cluster/Segment ID Age 0 50 100 150 200 250 1 2 3 4 5 6 7 8 Cluster/Segment ID Net Stay 0 0.5 1 1.5 2 2.5 3 3.5 4 x 10 4 1 2 3 4 5 6 7 8 Cluster/Segment ID ARPU
Slide | 9 VALIDATION WITH DISTRIBUTION ANALYSIS Cluster ID Cx in Cluster Avg. Age Spread Age Avg. Net- Stay Spred Net-Stay Avg. ARPU Spread ARPU Post Paid Pre Paid Female Male 1 913 27 5 28 26 1231 1427 90 823 913 0 2 930 28 5 19 16 1407 1699 159 771 0 930 3 407 53 8 46 35 1095 1303 34 373 407 0 4 409 54 8 34 24 967 919 66 343 0 409 5 556 36 11 82 43 2601 2399 546 10 556 0 6 542 32 5 95 27 1031 927 0 542 67 475 7 1116 36 9 96 44 2917 2669 1116 0 0 1116 8 348 57 7 131 33 1205 853 147 201 33 315 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 0 50 100 150 200 Histogram Cx Age, Male Age (years) Frequency 15 20 25 30 35 40 45 50 55 60 65 70 75 0 50 100 150 Histogram Cx Age, Female Age (years) Frequency Due to a certain bi-modal nature, clustering able to identify the modes in the Age histograms
Slide | 10 Cluster ID Data points in Cluster Avg. Age Spread Age Avg. Net- Stay Spred Net-Stay Avg. ARPU Spread ARPU Number Post Paid Number Pre Paid Number Female Number Male 1 913 27 5 28 26 1231 1427 90 823 913 0 2 930 28 5 19 16 1407 1699 159 771 0 930 3 407 53 8 46 35 1095 1303 34 373 407 0 4 409 54 8 34 24 967 919 66 343 0 409 5 556 36 11 82 43 2601 2399 546 10 556 0 6 542 32 5 95 27 1031 927 0 542 67 475 7 1116 36 9 96 44 2917 2669 1116 0 0 1116 8 348 57 7 131 33 1205 853 147 201 33 315 0 12 24 36 48 60 72 84 96 108 120 132 144 156 168 180 192 204 216 228 240 0 50 100 150 200 250 Histogram Cx Network Stay Net Stay(months) Frequency No identifiable structure in Net Stay distribution VALIDATION WITH DISTRIBUTION ANALYSIS Cluster Segment Profile
Slide | 11 CLUSTERING NUMERIC PART OF SEGMENTS IN 3D -2 0 2 4 -5 0 5 -5 0 5 10 15 20 Age (normalized) Segmental Analysis: Age, Net Stay and ARPU Net-Stay (normalized) ARPU(normalized) 1 2 3 4 5 6 7 8
Slide | 12 NOTABLE POINTS • Allows us to cluster most attributes (within reason) • Particularly if the categorical attributes do not have many different component values • Reasonable convergence performance both in terms of run time and number of iterations • Different dissimilarity measures and distance criteria will give differing results • The influence of the categorical part via gamma may also need to change with the method used • Algorithm somewhat sensitive to initial conditions – initialization of centroids • Explore likelihood of falling in to a local minima and getting trapped there leading to a sub optimal final solution • To do….. • Each drop can result in a non unique final result but will not impact the underlying trends and insights in to each segment

Mixed Numeric and Categorical Attribute Clustering Algorithm

  • 1.
    MIXED NUMERIC ANDCATEGORICAL ATTRIBUTE CLUSTERING ALGORITHM MODELING DR. ASOKA KORALE, C.ENG. MIET & MIESL
  • 2.
    ADVANTAGES TO NUMERICAND CATEGORICAL ATTRIBUTE CLUSTERING Slide | 2 Improved Targeting in Campaigns & Insight in to Segments Currently clustering on numeric variables Age, Net Stay, ARPU PRIMARY ATTRIBUTES THAT CAN BE INCLUDED WITH MIXED ATTRIBUTE TYPE CLUSTERING – ACCOUNT TYPE, GENDER, GEO LOCATION, …… Currently Fuzzy C – Means Algorithm used in Clustering Digital Advertizing SEGMENTATIONS INCREASINGLY BASED ON CLUSTERING Include other Categorical attributes depending on Interest segment to create –”Micro Segments”
  • 3.
    WIDENING POTENTIAL INSIGHTSTHROUGH CATEGORICAL CLUSTERING Slide | 3 Improved Targeting in Campaigns & All Attributes Can be Clustered – leading to very specific and wider array of segments Geographic attribute clustering to incorporate Income/ARPU hotspots at micro level
  • 4.
    CONCEPT UNDERLYING THEMIXED K PROTOTYPES ALGORITHM [1] Slide | 4 point “d” and point “c” may switch sides depending on how similar the numeric part and categorical part of the point is similar to the numeric and categorical part of the centroid (prototype) Influence or contribution of Numeric and Categorical Attributes of a data point can be controlled via a parameter “gamma” Point “a” may switch if the categorical part is closer to the categorical centroid (prototype) more than its numeric part is close to the numeric part of the centroid. Numeric and Categorical Attributes parts of a data point can be considered separately and two sets of centroids act as attractors for each Attribute type in each cluster Numeric Attribute1 Shapes represent two values of a single categorical variable Numeric Attribute2 [1]. Huang, CSIRO, Australia
  • 5.
    MIXED K PROTOTYPESALGORITHM [1] Slide | 5 Distance measure to a prototype (center) of two parts – numeric and categorical Numeric Attributes - Euclidian Distance Categorical Attributes – Dissimilarity Measure Centroid of Numeric Attributes – a simple average of the points in that cluster Includes “Yij” a fuzzy membership function if we wish to go in that direction
  • 6.
    MIXED K PROTOTYPESALGORITHM [1] Slide | 6 Minimize the total cost “E” which is the sum of the distances to the numeric and categorical parts of the centroid (prototype) Centroid of Categorical attributes determined on highest frequency of attribute value in each cluster
  • 7.
    Slide | 7 CONVERGENCEPERFORMANCE 0 5 10 15 20 25 30 35 40 0 200 400 600 800 1000 1200 1400 1600 Total no of switches at each iteration IterationNumber 0 5 10 15 20 25 30 35 40 1.2 1.3 1.4 1.5 1.6 1.7 1.8 x 10 4 IterationNumber Total Distance at eachiteration 1 2 3 4 5 6 7 8 0 5 10 15 20 25 30 35 40 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 IterationNumber Total Categorical Distance at eachiteration 1 2 3 4 5 6 7 8
  • 8.
    Slide | 8 CLUSTER& SEGMENT PROFILE 1 2 3 4 5 6 7 8 0 200 400 600 800 1000 1200 Number of Cx in eachCluster Cluster ID 20 30 40 50 60 70 80 90 1 2 3 4 5 6 7 8 Cluster/Segment ID Age 0 50 100 150 200 250 1 2 3 4 5 6 7 8 Cluster/Segment ID Net Stay 0 0.5 1 1.5 2 2.5 3 3.5 4 x 10 4 1 2 3 4 5 6 7 8 Cluster/Segment ID ARPU
  • 9.
    Slide | 9 VALIDATIONWITH DISTRIBUTION ANALYSIS Cluster ID Cx in Cluster Avg. Age Spread Age Avg. Net- Stay Spred Net-Stay Avg. ARPU Spread ARPU Post Paid Pre Paid Female Male 1 913 27 5 28 26 1231 1427 90 823 913 0 2 930 28 5 19 16 1407 1699 159 771 0 930 3 407 53 8 46 35 1095 1303 34 373 407 0 4 409 54 8 34 24 967 919 66 343 0 409 5 556 36 11 82 43 2601 2399 546 10 556 0 6 542 32 5 95 27 1031 927 0 542 67 475 7 1116 36 9 96 44 2917 2669 1116 0 0 1116 8 348 57 7 131 33 1205 853 147 201 33 315 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 0 50 100 150 200 Histogram Cx Age, Male Age (years) Frequency 15 20 25 30 35 40 45 50 55 60 65 70 75 0 50 100 150 Histogram Cx Age, Female Age (years) Frequency Due to a certain bi-modal nature, clustering able to identify the modes in the Age histograms
  • 10.
    Slide | 10 Cluster ID Data pointsin Cluster Avg. Age Spread Age Avg. Net- Stay Spred Net-Stay Avg. ARPU Spread ARPU Number Post Paid Number Pre Paid Number Female Number Male 1 913 27 5 28 26 1231 1427 90 823 913 0 2 930 28 5 19 16 1407 1699 159 771 0 930 3 407 53 8 46 35 1095 1303 34 373 407 0 4 409 54 8 34 24 967 919 66 343 0 409 5 556 36 11 82 43 2601 2399 546 10 556 0 6 542 32 5 95 27 1031 927 0 542 67 475 7 1116 36 9 96 44 2917 2669 1116 0 0 1116 8 348 57 7 131 33 1205 853 147 201 33 315 0 12 24 36 48 60 72 84 96 108 120 132 144 156 168 180 192 204 216 228 240 0 50 100 150 200 250 Histogram Cx Network Stay Net Stay(months) Frequency No identifiable structure in Net Stay distribution VALIDATION WITH DISTRIBUTION ANALYSIS Cluster Segment Profile
  • 11.
    Slide | 11 CLUSTERING NUMERICPART OF SEGMENTS IN 3D -2 0 2 4 -5 0 5 -5 0 5 10 15 20 Age (normalized) Segmental Analysis: Age, Net Stay and ARPU Net-Stay (normalized) ARPU(normalized) 1 2 3 4 5 6 7 8
  • 12.
    Slide | 12 NOTABLE POINTS •Allows us to cluster most attributes (within reason) • Particularly if the categorical attributes do not have many different component values • Reasonable convergence performance both in terms of run time and number of iterations • Different dissimilarity measures and distance criteria will give differing results • The influence of the categorical part via gamma may also need to change with the method used • Algorithm somewhat sensitive to initial conditions – initialization of centroids • Explore likelihood of falling in to a local minima and getting trapped there leading to a sub optimal final solution • To do….. • Each drop can result in a non unique final result but will not impact the underlying trends and insights in to each segment