ML::Clustering

This blog posts proclaims and describes the Raku package “ML::Clustering”  that provides Machine Learning (ML) Clustering (or Cluster analysis) functions, [Wk1].

The Clustering framework includes:

  • The algorithms K-means and K-medoids, and others
  • The distance functions Euclidean, Cosine, Hamming, Manhattan, and others, and their corresponding similarity functions

The data in the examples below is generated and manipulated with the packages “Data::Generators”, “Data::Reshapers”, and “Data::Summarizers”, described in the article “Introduction to data wrangling with Raku”, [AA1].

The plots are made with the package “Text::Plot”, [AAp6].


Installation

Via zef-ecosystem:

zef install ML::Clustering

From GitHub:

zef install https://github.com/antononcube/Raku-ML-Clustering

Usage example

Here we derive a set of random points, and summarize it:

use Data::Generators; use Data::Summarizers; use Text::Plot; my $n = 100; my @data1 = (random-variate(NormalDistribution.new(5,1.5), $n) X random-variate(NormalDistribution.new(5,1), $n)).pick(30); my @data2 = (random-variate(NormalDistribution.new(10,1), $n) X random-variate(NormalDistribution.new(10,1), $n)).pick(50); my @data3 = [|@data1, |@data2].pick(*); records-summary(@data3)
# +------------------------------+-----------------------------+ # | 1 | 0 | # +------------------------------+-----------------------------+ # | Min => 2.3898838030195453 | Min => 2.304900205776566 | # | 1st-Qu => 5.706881157103716 | 1st-Qu => 5.736769825514594 | # | Mean => 7.784565074436171 | Mean => 8.02083978767615 | # | Median => 8.324205488000889 | Median => 9.333349983753054 | # | 3rd-Qu => 9.667770938027495 | 3rd-Qu => 9.951571353489859 | # | Max => 12.366646976770186 | Max => 11.87813636253523 | # +------------------------------+-----------------------------+

Here we plot the points:

use Text::Plot; text-list-plot(@data3)
# +-+----------+----------+---------+----------+----------+--+ # | | # + * * + 12.00 # | **** * * | # + * * *** ** * * + 10.00 # | * * * ** * ** | # + * * **** * * + 8.00 # | * ** * | # | * * * * | # + * *** * * * + 6.00 # | * * ** * * * * * | # + * * + 4.00 # | * * * | # + + 2.00 # +-+----------+----------+---------+----------+----------+--+ # 2.00 4.00 6.00 8.00 10.00 12.00

Problem: Group the points in such a way that each group has close (or similar) points.

Here is how we use the function find-clusters to give an answer:

use ML::Clustering; my %res = find-clusters(@data3, 2, prop => 'All'); %res<Clusters>>>.elems
# (31 49)

Remark: The first argument is data points that is a list-of-numeric-lists. The second argument is a number of clusters to be found. (It is in the TODO list to have the number clusters automatically determined – currently they are not.)

Remark: The function find-clusters can return results of different types controlled with the named argument “prop”. Using prop => 'All' returns a hash with all properties of the cluster finding result.

Here are sample points from each found cluster:

.say for %res<Clusters>>>.pick(3);
# ((7.739550750023431 7.869526528329702) (7.436113407675195 5.047068255152369) (3.137868648226576 6.18246060543501)) # ((10.196518357205878 10.291337792828818) (9.514778751211171 10.904815191998523) (10.118479992486252 8.418809517175601))

Here are the centers of the clusters (the mean points):

%res<MeanPoints>
# [(10.013273073426063 9.513630351537644) (5.805077424361722 6.072817033230708)]

We can verify the result by looking at the plot of the found clusters:

text-list-plot((|%res<Clusters>, %res<MeanPoints>), point-char => <▽ ☐ ●>, title => '▽ - 1st cluster; ☐ - 2nd cluster; ● - cluster centers')
# ▽ - 1st cluster; ☐ - 2nd cluster; ● - cluster centers # ++----------+-----------+----------+----------+-----------++ # | ☐ | # + ☐ + 12.00 # | ☐☐☐☐ ☐ ☐ | # + ☐ ☐ ☐☐☐☐☐ ☐ ☐ ☐ ☐ + 10.00 # | ☐☐ ☐ ☐ ●☐ ☐ ☐ | # + ▽ ☐ ☐☐☐☐ ☐ ☐ + 8.00 # | ☐ ☐☐☐☐ | # | ▽ ▽ ▽ ▽● | # + ▽ ▽▽ ▽ ▽ ▽ + 6.00 # | ▽ ▽▽ ▽ ▽ ▽ ▽▽ ▽ | # + ▽ ▽ ▽ ▽ + 4.00 # | ▽ ▽ | # + ▽ + 2.00 # ++----------+-----------+----------+----------+-----------++ # 2.00 4.00 6.00 8.00 10.00 12.00

Remark: By default find-clusters uses the K-means algorithm. The functions k-means and k-medoids call find-clusters with the option settings method=>'K-means' and method=>'K-medoids' respectively.


More interesting looking data

Here is more interesting looking two-dimensional data, data2D2:

use Data::Reshapers; my $pointsPerCluster = 200; my @data2D5 = [[10,20,4],[20,60,6],[40,10,6],[-30,0,4],[100,100,8]].map({ random-variate(NormalDistribution.new($_[0], $_[2]), $pointsPerCluster) Z random-variate(NormalDistribution.new($_[1], $_[2]), $pointsPerCluster) }).Array; @data2D5 = flatten(@data2D5, max-level=>1).pick(*); @data2D5.elems
# 1000

Here is a plot of that data:

text-list-plot(@data2D5)
# ++---------------+--------------+---------------+----------+ # | | # | * ** ** | # | * ************* | # + ************** + 100.00 # | * * ** * *** | # | ********* * | # | *********** | # + **** + 50.00 # | ******* ** | # | ****************** | # + ****** ** ********** + 0.00 # | ******* * ** | # | | # ++---------------+--------------+---------------+----------+ # -50.00 0.00 50.00 100.00

Here we find clusters and plot them together with their mean points:

srand(32); my %clRes = find-clusters(@data2D5, 5, prop=>'All'); text-list-plot([|%clRes<Clusters>, %clRes<MeanPoints>], point-char=><1 2 3 4 5 ●>)
# +---------------+---------------+----------------+---------+ # | 4 | # | 4 44444 44 | # + 4 444444●44444444 + 100.00 # | 444444444444 | # | 3 4 444 4 | # | 3333333333 | # | 33333●33333 | # + 333333 + 50.00 # | 22222 1 | # | 2222●222221 111111 | # | 555555 22222222 1111●1111 | # + 5555●555 11111111 1 + 0.00 # |5 5 555 | # +---------------+---------------+----------------+---------+ # 0.00 50.00 100.00

Detailed function pages

Detailed parameter explanations and usage examples for the functions provided by the package are given in:


Implementation considerations

UML diagram

Here is a UML diagram that shows package’s structure:

class-diagram

The PlantUML spec and diagram were obtained with the CLI script to-uml-spec of the package “UML::Translators”, [AAp6].

Here we get the PlantUML spec:

to-uml-spec ML::AssociationRuleLearning > ./resources/class-diagram.puml

Here get the diagram:

to-uml-spec ML::Clustering | java -jar ~/PlantUML/plantuml-1.2022.5.jar -pipe > ./resources/class-diagram.png

Remark: Maybe it is a good idea to have an abstract class named, say, ML::Clustering::AbstractFinder that is a parent of ML::Clustering::KMeans, ML::Clustering::KMedoids, ML::Clustering::BiSectionalKMeans, etc., but I have not found to be necessary. (At this point of development.)

Remark: It seems it is better to have a separate package for the distance functions, named, say, “ML::DistanceFunctions”. (Although distance functions are not just for ML…) After thinking over package and function names I will make such a package.


References

Articles

[Wk1] Wikipedia entry, “Cluster Analysis”.

[AA1] Anton Antonov, “Introduction to data wrangling with Raku”, (2021), RakuForPrediction at WordPress.

Packages

[AAp1] Anton Antonov, Bi-sectional K-means algorithm in Mathematica, (2020), MathematicaForPrediction at GitHub/antononcube.

[AAp2] Anton Antonov, Data::Generators Raku package, (2021), GitHub/antononcube.

[AAp3] Anton Antonov, Data::Reshapers Raku package, (2021), GitHub/antononcube.

[AAp4] Anton Antonov, Data::Summarizers Raku package, (2021), GitHub/antononcube.

[AAp5] Anton Antonov, UML::Translators Raku package, (2022), GitHub/antononcube.

[AAp6] Anton Antonov, Text::Plot Raku package, (2022), GitHub/antononcube.

One thought on “ML::Clustering

Leave a comment