Add branch detection functionality #648

JelmerBot · 2024-07-24T13:07:46Z

Dear maintainers,

This pull request adds the branch detection functionality from our preprint (https://arxiv.org/abs/2311.15887). Our main goal was to detect branch hierarchies within already detected clusters. These hierarchies describe cluster shapes, which can reveal subgroups not expressed in the density profile.

I have tried to add the functionality in way that minimises its impact on the codebase. I settled on the pattern used by the prediction module. The main usage pattern now looks like this:

from hdbscan import HDBSCAN, BranchDetector clusterer = HDBSCAN(branch_detection_data=True).fit(data) branch_detector = BranchDetector().fit(clusterer)

The BranchDetector class mimics the HDBSCAN class and provides access to labels, membership probabilities, the detected hierarchies, and more. This way, end-users that just want clusters do not have to interact with the branch detection functionality at all.

I needed to make a couple of unrelated changes in Cython code to make all tests pass on my machine. I will try to mark these changes with review comments in the PR. Please advice on whether I should remove these changes from the PR or keep them in.

I hope you will consider merging this PR. Let me know if things need to be fixed / changed to better match your vision for the project.

Kind regards,

Jelmer Bot

review-notebook-app · 2024-07-24T13:07:51Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

JelmerBot

Additional comments and questions about specific parts of the code:

JelmerBot · 2024-07-24T13:11:16Z

hdbscan/_hdbscan_linkage.pyx


 right_value = current_distances[j]
- right_source = current_sources[j]
+ right_source = <np.intp_t> current_sources[j]


The changes here solved the following test errors on my machine:

FAILED hdbscan/tests/test_hdbscan.py::test_hdbscan_prims_kdtree - TypeError: 'float' object cannot be interpreted as an integer FAILED hdbscan/tests/test_hdbscan.py::test_hdbscan_prims_balltree - TypeError: 'float' object cannot be interpreted as an integer FAILED hdbscan/tests/test_hdbscan.py::test_hdbscan_high_dimensional - TypeError: 'float' object cannot be interpreted as an integer FAILED hdbscan/tests/test_rsl.py::test_rsl_prims_balltree - TypeError: 'float' object cannot be interpreted as an integer FAILED hdbscan/tests/test_rsl.py::test_rsl_prims_kdtree - TypeError: 'float' object cannot be interpreted as an integer

These tests do not appear to be broken on the test runner for the master branch but were crashing my machine.
Could this be related to a specific Cython version?

JelmerBot · 2024-07-24T13:16:13Z

hdbscan/_hdbscan_tree.pyx

- union_find.union_(<np.intp_t> row[0], cluster)
- union_find.union_(<np.intp_t> row[1], cluster)
+ union_find.union_(np.intp(row[0]), cluster)
+ union_find.union_(np.intp(row[1]), cluster)


The changes here solved to following tests on my machine:

FAILED hdbscan/tests/test_hdbscan.py::test_hdbscan_dbscan_clustering - TypeError: 'numpy.float64' object cannot be interpreted as an integer FAILED hdbscan/tests/test_rsl.py::test_rsl_distance_matrix - TypeError: 'numpy.float64' object cannot be interpreted as an integer FAILED hdbscan/tests/test_rsl.py::test_rsl_feature_vector - TypeError: 'numpy.float64' object cannot be interpreted as an integer FAILED hdbscan/tests/test_rsl.py::test_rsl_callable_metric - TypeError: 'numpy.float64' object cannot be interpreted as an integer FAILED hdbscan/tests/test_rsl.py::test_rsl_boruvka_balltree - TypeError: 'numpy.float64' object cannot be interpreted as an integer FAILED hdbscan/tests/test_rsl.py::test_rsl_prims_balltree - TypeError: 'numpy.float64' object cannot be interpreted as an integer FAILED hdbscan/tests/test_rsl.py::test_rsl_prims_kdtree - TypeError: 'numpy.float64' object cannot be interpreted as an integer FAILED hdbscan/tests/test_rsl.py::test_rsl_hierarchy - TypeError: 'numpy.float64' object cannot be interpreted as an integer FAILED hdbscan/tests/test_rsl.py::test_rsl_high_dimensional - TypeError: 'numpy.float64' object cannot be interpreted as an integer

Similar to the previous comment, these tests do not appear to be broken on the test runner for the master branch but were crashing my machine.

JelmerBot · 2024-07-24T13:18:16Z

hdbscan/_hdbscan_tree.pyx


-cdef get_probabilities(np.ndarray tree, dict cluster_map, np.ndarray labels):
-
+cdef get_probabilities(np.ndarray tree, dict cluster_map, np.ndarray labels, np.ndarray deaths):


Moving deaths to be an input parameter allows me to re-use the function for branch hierarchies.

JelmerBot · 2024-07-24T13:22:59Z

hdbscan/_hdbscan_tree.pyx


 root = cluster_tree['parent'].min()
- parent = cluster_tree[cluster_tree['child'] == leaf]['parent']
+ parent = cluster_tree[cluster_tree['child'] == leaf]['parent'][0]


This change and the one on line 691 fixed the following tests on my machine:

FAILED hdbscan/tests/test_flat.py::test_flat_base_epsilon - TypeError: only integer scalar arrays can be converted to a scalar index FAILED hdbscan/tests/test_flat.py::test_switch_to_leaf - TypeError: only integer scalar arrays can be converted to a scalar index FAILED hdbscan/tests/test_flat.py::test_approx_predict_same_clusters - TypeError: only integer scalar arrays can be converted to a scalar index FAILED hdbscan/tests/test_flat.py::test_approx_predict_diff_clusters - TypeError: only integer scalar arrays can be converted to a scalar index FAILED hdbscan/tests/test_flat.py::test_mem_vec_same_clusters - TypeError: only integer scalar arrays can be converted to a scalar index FAILED hdbscan/tests/test_flat.py::test_mem_vec_diff_clusters - TypeError: only integer scalar arrays can be converted to a scalar index FAILED hdbscan/tests/test_flat.py::test_all_points_mem_vec_same_clusters - TypeError: only integer scalar arrays can be converted to a scalar index FAILED hdbscan/tests/test_flat.py::test_all_points_mem_vec_diff_clusters - TypeError: only integer scalar arrays can be converted to a scalar index

Again, these tests do not appear to be broken on the test runner but were crashing my machine.

JelmerBot · 2024-07-24T13:28:01Z

hdbscan/_hdbscan_tree.pyx

 return set(selected_clusters)

+
+cpdef np.ndarray simplify_branch_hierarchy(np.ndarray condensed_tree, 


The idea of persistence differs for clusters and branches. With cluster, all data points enter the filtration at distance=0 so a persistence threshold can be implemented as a minimum epsilon value. With branches, points themselves have an eccentricity value, so we need to look at the eccentricity range to determine persistence. It was easier to implement the persistence threshold by simplifying the hierarchy rather than to adapt the epsilon search approach.

JelmerBot · 2024-07-24T13:51:39Z

hdbscan/hdbscan_.py

 "than mere distances is required!"
 )

+ def generate_branch_detection_data(self):


Not sure how useful this function is. It does not re-compute the minimum spanning tree (MST) because that is quite involved, but the MST is required for detecting branches. That leaves a potentially common unsupported pattern:

# This works clusterer = HDBSCAN().fit(data) clusterer.generate_branch_detection_data() # This breaks because MST not available... branch_detector = BranchDetector(clusterer)

Is there a better way to deal with this case? Should I fail early within generate_branch_detection_data() when an MST is not available? Is there an easy way to re-compute the MST?

I think failing early with a clear error message and suggestion for how to fix it would be the best approach here.

lmcinnes

Thanks for such a complete and comprehensive PR! I especially appreciate the comprehensive documentation and tests. That makes this process so much easier for me. I'm just going to merge as is because I would really like to have this functionality available, and I'll handle the early error on lack on min_spanning_tree.

Thanks again for all your work on this. I really do think this is a major contribution and will be very useful for a great many people.

lmcinnes · 2024-08-05T16:42:15Z

hdbscan/hdbscan_.py

 "than mere distances is required!"
 )

+ def generate_branch_detection_data(self):


I think failing early with a clear error message and suggestion for how to fix it would be the best approach here.

Add branch detection functionality

f58b545

JelmerBot commented Jul 24, 2024

View reviewed changes

Deal with os variance

04c77fe

JelmerBot commented Jul 24, 2024

View reviewed changes

lmcinnes reviewed Aug 5, 2024

View reviewed changes

lmcinnes merged commit a9f74b2 into scikit-learn-contrib:master Aug 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add branch detection functionality #648

Add branch detection functionality #648

Uh oh!

JelmerBot commented Jul 24, 2024

review-notebook-app bot commented Jul 24, 2024

JelmerBot left a comment

JelmerBot Jul 24, 2024

JelmerBot Jul 24, 2024

JelmerBot Jul 24, 2024

JelmerBot Jul 24, 2024

JelmerBot Jul 24, 2024

JelmerBot Jul 24, 2024

lmcinnes Aug 5, 2024

lmcinnes left a comment

lmcinnes Aug 5, 2024

Labels

2 participants


		cdef get_probabilities(np.ndarray tree, dict cluster_map, np.ndarray labels):

		cdef get_probabilities(np.ndarray tree, dict cluster_map, np.ndarray labels, np.ndarray deaths):

		return set(selected_clusters)


		cpdef np.ndarray simplify_branch_hierarchy(np.ndarray condensed_tree,

Add branch detection functionality #648

Add branch detection functionality #648

Uh oh!

Conversation

JelmerBot commented Jul 24, 2024

review-notebook-app bot commented Jul 24, 2024

JelmerBot left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lmcinnes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Labels

2 participants