Skip to content

Conversation

@n9Mtq4
Copy link

@n9Mtq4 n9Mtq4 commented Mar 22, 2023

This PR fixes an issue with prediction data not using cluster_selection_epsilon. This bug surfaces with wrong predictions from approximate_predict and incorrect exemplars_.

Code to reproduce the problem:

import hdbscan from sklearn.datasets import make_blobs blobs, _ = make_blobs(100, n_features=8, centers=10, random_state=42) # use a high epsilon to force fewer clusters. real world data this happens more easily clusterer = hdbscan.HDBSCAN(cluster_selection_epsilon=12.0, prediction_data=True) clusterer.fit(blobs) # 7 clusters from labels clusterer.labels_.max() + 1 # 10 clusters from exemplars len(clusterer.exemplars_) # [5, 4, 3, 0, 5, 5, 6, 0, 5, 1] clusterer.labels_[:10] # predicting assigns points to completely different clusters (and number of clusters!) # [6, 5, 4, 0, 6, 6, 9, 0, 6, 2] hdbscan.approximate_predict(clusterer, blobs[:10])

I tracked the issue down to prediction data selecting the clusters from the tree differently to how it's done in _hdbscan_tree.pyx. The fix is to return the selected clusters from get_clusters in _hdbscan_tree.pyx and use the same clusters for prediction.

With this PR:

# 7 clusters from labels clusterer.labels_.max() + 1 # 7 clusters from exemplars len(clusterer.exemplars_) # [5, 4, 3, 0, 5, 5, 6, 0, 5, 1] clusterer.labels_[:10] # predicting assigns points to correct clusters # [5, 4, 3, 0, 5, 5, 6, 0, 5, 1] hdbscan.approximate_predict(clusterer, blobs[:10])

This likely fixes #308

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant