Skip to content

Conversation

@eroell
Copy link
Contributor

@eroell eroell commented Sep 5, 2023

Reference Issue

Fixes #1040.

What does this implement/fix? Explain your changes.

This is a suggestion to use the same dataset generator sklearn.datasets.make_classification as is used in the doc of e.g. ClusterCentroids instead of the dead sklearn.datasets.fetch_mldata.

Any other comments?

This is a suggestion, if you are interested in a low-cost fix. There may be more interesting datasets which could be used, or ones you find more illustrative.

btw pycodestyle gave me a few warnings of E501 about lines being longer than 79 characters, but I think you have a max line length of 88 in your checks which I think is also fine.

@eroell
Copy link
Contributor Author

eroell commented Sep 6, 2023

It seems some tests have failed, but it seems to me these would be unrelated to the suggested changes.

@glemaitre
Copy link
Member

Yes the test is failing for numerical issue. 315 / 404 is 0.779 while we expect > 0.78.
Not a big deal for this PR.

--------
>>> from collections import Counter # doctest: +SKIP
>>> from sklearn.datasets import fetch_mldata # doctest: +SKIP
>>> from sklearn.datasets import make_classification # doctest: +SKIP
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead I would prefer that we use fetch_openml and still fetch the pima dataset.
It should look like:

from sklearn.datasets import fetch_openml from sklearn.preprocesing import scale X, y = fetch_openml("diabetes", version=1, return_X_y=True) X = scale(X)

In this manner, we should still deal with the same dataset.

Copy link
Contributor Author

@eroell eroell Sep 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used this dataset now. As opposed to
Resampled dataset shape Counter({-1: 268, 1: 227})
in the original example, the counts have now changed to
Resampled dataset shape Counter({{'tested_positive': 268, 'tested_negative': 181}}).

It also seems that omitting the scaling does not retrieve the counts in the original example. Might be due to a difference to the original, now not anymore findable dataset - noticed that also labels have been changed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a huge deal (we might not apply exactly the same scaling). But this is closer to the original example. Updating the labels and counts would be enough here.

@glemaitre glemaitre changed the title [MRG] suggested new dataset in doc of CondensedNearestNeighbour DOC udpate the datasets used in CondensedNearestNeighbour docstring Sep 7, 2023
@eroell eroell marked this pull request as draft September 7, 2023 09:53
@eroell eroell marked this pull request as ready for review September 7, 2023 11:07
@glemaitre glemaitre merged commit dcb476d into scikit-learn-contrib:master Sep 7, 2023
@glemaitre
Copy link
Member

Thanks @eroell LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants