Skip to content

Conversation

@JelmerBot
Copy link
Collaborator

Dear maintainers,

This pull request adds the branch detection functionality from our preprint (https://arxiv.org/abs/2311.15887). Our main goal was to detect branch hierarchies within already detected clusters. These hierarchies describe cluster shapes, which can reveal subgroups not expressed in the density profile.

I have tried to add the functionality in way that minimises its impact on the codebase. I settled on the pattern used by the prediction module. The main usage pattern now looks like this:

from hdbscan import HDBSCAN, BranchDetector clusterer = HDBSCAN(branch_detection_data=True).fit(data) branch_detector = BranchDetector().fit(clusterer)

The BranchDetector class mimics the HDBSCAN class and provides access to labels, membership probabilities, the detected hierarchies, and more. This way, end-users that just want clusters do not have to interact with the branch detection functionality at all.

I needed to make a couple of unrelated changes in Cython code to make all tests pass on my machine. I will try to mark these changes with review comments in the PR. Please advice on whether I should remove these changes from the PR or keep them in.

I hope you will consider merging this PR. Let me know if things need to be fixed / changed to better match your vision for the project.

Kind regards,

Jelmer Bot

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Copy link
Collaborator Author

@JelmerBot JelmerBot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional comments and questions about specific parts of the code:


right_value = current_distances[j]
right_source = current_sources[j]
right_source = <np.intp_t> current_sources[j]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes here solved the following test errors on my machine:

FAILED hdbscan/tests/test_hdbscan.py::test_hdbscan_prims_kdtree - TypeError: 'float' object cannot be interpreted as an integer FAILED hdbscan/tests/test_hdbscan.py::test_hdbscan_prims_balltree - TypeError: 'float' object cannot be interpreted as an integer FAILED hdbscan/tests/test_hdbscan.py::test_hdbscan_high_dimensional - TypeError: 'float' object cannot be interpreted as an integer FAILED hdbscan/tests/test_rsl.py::test_rsl_prims_balltree - TypeError: 'float' object cannot be interpreted as an integer FAILED hdbscan/tests/test_rsl.py::test_rsl_prims_kdtree - TypeError: 'float' object cannot be interpreted as an integer 

These tests do not appear to be broken on the test runner for the master branch but were crashing my machine.
Could this be related to a specific Cython version?

union_find.union_(<np.intp_t> row[0], cluster)
union_find.union_(<np.intp_t> row[1], cluster)
union_find.union_(np.intp(row[0]), cluster)
union_find.union_(np.intp(row[1]), cluster)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes here solved to following tests on my machine:

FAILED hdbscan/tests/test_hdbscan.py::test_hdbscan_dbscan_clustering - TypeError: 'numpy.float64' object cannot be interpreted as an integer FAILED hdbscan/tests/test_rsl.py::test_rsl_distance_matrix - TypeError: 'numpy.float64' object cannot be interpreted as an integer FAILED hdbscan/tests/test_rsl.py::test_rsl_feature_vector - TypeError: 'numpy.float64' object cannot be interpreted as an integer FAILED hdbscan/tests/test_rsl.py::test_rsl_callable_metric - TypeError: 'numpy.float64' object cannot be interpreted as an integer FAILED hdbscan/tests/test_rsl.py::test_rsl_boruvka_balltree - TypeError: 'numpy.float64' object cannot be interpreted as an integer FAILED hdbscan/tests/test_rsl.py::test_rsl_prims_balltree - TypeError: 'numpy.float64' object cannot be interpreted as an integer FAILED hdbscan/tests/test_rsl.py::test_rsl_prims_kdtree - TypeError: 'numpy.float64' object cannot be interpreted as an integer FAILED hdbscan/tests/test_rsl.py::test_rsl_hierarchy - TypeError: 'numpy.float64' object cannot be interpreted as an integer FAILED hdbscan/tests/test_rsl.py::test_rsl_high_dimensional - TypeError: 'numpy.float64' object cannot be interpreted as an integer 

Similar to the previous comment, these tests do not appear to be broken on the test runner for the master branch but were crashing my machine.


cdef get_probabilities(np.ndarray tree, dict cluster_map, np.ndarray labels):

cdef get_probabilities(np.ndarray tree, dict cluster_map, np.ndarray labels, np.ndarray deaths):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving deaths to be an input parameter allows me to re-use the function for branch hierarchies.


root = cluster_tree['parent'].min()
parent = cluster_tree[cluster_tree['child'] == leaf]['parent']
parent = cluster_tree[cluster_tree['child'] == leaf]['parent'][0]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change and the one on line 691 fixed the following tests on my machine:

FAILED hdbscan/tests/test_flat.py::test_flat_base_epsilon - TypeError: only integer scalar arrays can be converted to a scalar index FAILED hdbscan/tests/test_flat.py::test_switch_to_leaf - TypeError: only integer scalar arrays can be converted to a scalar index FAILED hdbscan/tests/test_flat.py::test_approx_predict_same_clusters - TypeError: only integer scalar arrays can be converted to a scalar index FAILED hdbscan/tests/test_flat.py::test_approx_predict_diff_clusters - TypeError: only integer scalar arrays can be converted to a scalar index FAILED hdbscan/tests/test_flat.py::test_mem_vec_same_clusters - TypeError: only integer scalar arrays can be converted to a scalar index FAILED hdbscan/tests/test_flat.py::test_mem_vec_diff_clusters - TypeError: only integer scalar arrays can be converted to a scalar index FAILED hdbscan/tests/test_flat.py::test_all_points_mem_vec_same_clusters - TypeError: only integer scalar arrays can be converted to a scalar index FAILED hdbscan/tests/test_flat.py::test_all_points_mem_vec_diff_clusters - TypeError: only integer scalar arrays can be converted to a scalar index 

Again, these tests do not appear to be broken on the test runner but were crashing my machine.

return set(selected_clusters)


cpdef np.ndarray simplify_branch_hierarchy(np.ndarray condensed_tree,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea of persistence differs for clusters and branches. With cluster, all data points enter the filtration at distance=0 so a persistence threshold can be implemented as a minimum epsilon value. With branches, points themselves have an eccentricity value, so we need to look at the eccentricity range to determine persistence. It was easier to implement the persistence threshold by simplifying the hierarchy rather than to adapt the epsilon search approach.

"than mere distances is required!"
)

def generate_branch_detection_data(self):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how useful this function is. It does not re-compute the minimum spanning tree (MST) because that is quite involved, but the MST is required for detecting branches. That leaves a potentially common unsupported pattern:

# This works clusterer = HDBSCAN().fit(data) clusterer.generate_branch_detection_data() # This breaks because MST not available... branch_detector = BranchDetector(clusterer)

Is there a better way to deal with this case? Should I fail early within generate_branch_detection_data() when an MST is not available? Is there an easy way to re-compute the MST?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think failing early with a clear error message and suggestion for how to fix it would be the best approach here.

Copy link
Collaborator

@lmcinnes lmcinnes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for such a complete and comprehensive PR! I especially appreciate the comprehensive documentation and tests. That makes this process so much easier for me. I'm just going to merge as is because I would really like to have this functionality available, and I'll handle the early error on lack on min_spanning_tree.

Thanks again for all your work on this. I really do think this is a major contribution and will be very useful for a great many people.

"than mere distances is required!"
)

def generate_branch_detection_data(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think failing early with a clear error message and suggestion for how to fix it would be the best approach here.

@lmcinnes lmcinnes merged commit a9f74b2 into scikit-learn-contrib:master Aug 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants