Cluster minimizing sum of cost of clusters

Question

Given a dataset $X,$ having $p$ features, organize the units $x_i \in X $ into fixed number of clusters $g,$ with fixed cluster size $B.$

Clustering policy: minimize the sum of a linear combination of the average of the features inside the clusters. That is $x_i \in R^p $ we must find $[X_1,X_2,\ldots,X_g]$ st.: $$\min\sum_{s \in {1,\ldots,g}}\sum_{x_i \in X_s}\sum^p_{j=1}w^T_j\frac{\sum^B_{k=1}x_{kj}}{B}$$ where w is the vector of the weight of the linear combination on the averages of the feature inside the clusters.

I'm wondering how to solve such a problem, keeping in mind that my dataset is very big (50k units) and also lots of clusters must be found. In particular: $n = 50000, p=3, g=500$.

I tried approaching the problem as suggested here https://www.quora.com/Which-clustering-algorithms-can-be-run-on-a-dataset-of-3-million-samples, so I wrote my implementation of online k-means. However still for a single iterations it takes more than 20 minutes.

What is a good approach to the problem?

Is $w$ a single fixed vector in $R^p$, or does it depend on the cluster? The formula, as written, suggests the former, but then the result does not depend on the choice of clusters at all and the question becomes meaningless. Or, perhaps, I completely misinterpret the entire setup? — fedja
– fedja, Commented Sep 1, 2024 at 2:39
@fedja yes, is a fixed vector. Why the result shouldn't depends on the choice of clusters? The problem is to organize the units of the dataset in clusters, in order to minimize the sum of the costs of each cluster, which is the linear combination of feature's average in each cluster. — BiasedBayes
– BiasedBayes, Commented Sep 1, 2024 at 10:09
The sum you wrote is just $1/B$ times the sum of all $w^Tx_k$, isn't it? — fedja
– fedja, Commented Sep 1, 2024 at 11:44

Stack Exchange Network

Cluster minimizing sum of cost of clusters

0

You must log in to answer this question.

Cluster minimizing sum of cost of clusters

0

You must log in to answer this question.

Related