[WIP] ENH: SMOTE for pure categorical data #565

ThomasKluiters · 2019-05-05T17:38:24Z

Reference Issue

What does this implement/fix? Explain your changes.

I've seen some issues regarding the need for SMOTE with just categorical data.

Therefore I've added a class SMOTEN which operates on purely categorical data.

Any other comments?

This is not the final version but I'd love some feedback on the code - I understand there might be more elegant ways to implement said class as there is some code duplication going on between SMOTENC and SMOTEN - I can look into this issue.

pep8speaks · 2019-05-05T17:38:27Z

Hello @ThomasKluiters! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

In the file imblearn/over_sampling/_smote.py:

Line 953:80: E501 line too long (86 > 79 characters)
Line 1215:80: E501 line too long (81 > 79 characters)
Line 1248:1: W391 blank line at end of file

Comment last updated at 2019-05-07 20:41:38 UTC

chkoar · 2019-05-05T18:11:04Z

Thanks for this. CI are failing though.
I suppose that we need a example with this (or to include it in existing one) and some user guide documentation as well.

ThomasKluiters · 2019-05-05T19:05:26Z

Thanks for this. CI are failing though.
I suppose that we need a example with this (or to include it in existing one) and some user guide documentation as well.

Will do! Do you think a user guide will suffice for this to be merged, or is more refactoring required?

chkoar · 2019-05-05T19:13:20Z

As you said

there is some code duplication going on between SMOTENC and SMOTEN - I can look into this issue.

I think that we not in hurry.

ThomasKluiters · 2019-05-05T19:26:53Z

As you said

there is some code duplication going on between SMOTENC and SMOTEN - I can look into this issue.

I think that we not in hurry.

Yeah, of course! Just making sure what I should be doing :)

codecov · 2019-05-05T19:53:00Z

Codecov Report

Merging #565 into master will decrease coverage by 1.07%.
The diff coverage is 99.23%.

@@ Coverage Diff @@ ## master #565 +/- ## ========================================== - Coverage 97.93% 96.85% -1.08%  ========================================== Files 83 85 +2 Lines 4784 5278 +494 ========================================== + Hits 4685 5112 +427  - Misses 99 166 +67

Impacted Files	Coverage Δ
imblearn/over_sampling/_smote.py	`97.75% <100%> (+0.54%)`	⬆️
imblearn/over_sampling/__init__.py	`100% <100%> (ø)`	⬆️
imblearn/utils/estimator_checks.py	`96.74% <100%> (+7.85%)`	⬆️
imblearn/over_sampling/tests/test_smote_n.py	`99.02% <99.02%> (ø)`
imblearn/keras/tests/test_generator.py	`8.92% <0%> (-91.08%)`	⬇️
imblearn/tensorflow/_generator.py	`34.28% <0%> (-65.72%)`	⬇️
imblearn/keras/_generator.py	`40.35% <0%> (-58.16%)`	⬇️
imblearn/datasets/_zenodo.py	`93.75% <0%> (-2.92%)`	⬇️
imblearn/ensemble/_weight_boosting.py	`96.8% <0%> (-0.9%)`	⬇️
imblearn/ensemble/_forest.py	`98.09% <0%> (-0.07%)`	⬇️
... and 64 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 68123d0...c7e7036. Read the comment docs.

ThomasKluiters · 2019-05-07T21:06:52Z

@chkoar could you look into my PR :)

glemaitre · 2019-06-07T12:58:48Z

We will look at this before the release of 0.5

glemaitre · 2019-06-12T21:57:58Z

I think that we should implement the VDM distance as in Chawla et al. instead of reusing the SMOTE-NC implementation.

ThomasKluiters · 2019-06-13T17:30:42Z

I think that we should implement the VDM distance as in Chawla et al. instead of reusing the SMOTE-NC implementation.

Alright! I can do that - any other things you'd like to see change?

glemaitre · 2019-06-13T18:20:56Z

Alright! I can do that - any other things you'd like to see change?

I see that you introduced the parameter kind='regular'. We actually dropped those. I think that this better to inherit directly from BaseSMOTE instead of SMOTENC.

ThomasKluiters · 2019-06-25T11:31:45Z

Alright! I can do that - any other things you'd like to see change?

I see that you introduced the parameter kind='regular'. We actually dropped those. I think that this better to inherit directly from BaseSMOTE instead of SMOTENC.

Alright, I will start working on these proposed changes. I'm a bit busy with other contributions.

glemaitre · 2021-02-12T21:49:35Z

I am putting an implementation of the VDM distance. I don't know how long it would be in practice.
We might want to cynthonize it.

import itertools import numpy as np from sklearn.preprocessing import OrdinalEncoder from sklearn.utils.multiclass import unique_labels class ValueDifferenceMetric: """Class implementing the Value Difference Metric.    This metric compute the distance between samples containing  only nominal categorical features.    Parameters  ----------  r : int, default=2  The exposent used when accumulating the distance. `r=1`  corresponds to Manhattan and `r=2` corresponds to  Euclidean distance.    Attributes  ----------  classes_ : ndarray of shape (n_classes,)  The classes.  distance_per_class_ : dict  Precomputed distance matrix for each feature and each class.  """ def __init__(self, r=2): self.r = r def fit(self, X, y): """Compute the necessary statistics from the training set.    Parameters  ----------  X : ndarray of shape (n_samples, n_features)  The input data.  y : ndarray of shape (n_features,)  The target.    Returns  -------  self  """ self.encoder_ = OrdinalEncoder(dtype=np.int32) self.classes_ = unique_labels(y) X_encoded = self.encoder_.fit_transform(X) n_features = X_encoded.shape[1] # compute the categories counts per feature per class counts_per_class = { klass: [ np.bincount( X_encoded[y == klass, feature_idx], minlength=len(self.encoder_.categories_[feature_idx]) ).astype(np.float64) for feature_idx in range(n_features) ] for klass in self.classes_ } # compute the total categories counts per feature total_counts = [ np.zeros_like(counts_per_class[self.classes_[0]][feature_idx]) for feature_idx in range(n_features) ] for feature_idx in range(n_features): for klass in self.classes_: total_counts[feature_idx] += counts_per_class[klass][feature_idx] # compute the categorie probabilities per feature per class for feature_idx in range(n_features): for klass in self.classes_: counts_per_class[klass][feature_idx] /= total_counts[feature_idx] # compute the precomputed distances matrix for all combinations self.distance_per_class_ = { klass: [ np.abs( np.subtract.outer( counts_per_class[klass][feature_idx], counts_per_class[klass][feature_idx] ) ) for feature_idx in range(n_features) ] for klass in self.classes_ } return self def pairwise(self, X): """Compute the VDM distance pairwise.    Parameters  ----------  X : ndarray of shape (n_samples, n_features)  The input data.    Returns  -------  distance_matrix : ndarray of shape (n_samples, n_samples)  The VDM pairwise distance.  """ X_encoded = self.encoder_.transform(X) n_samples = X_encoded.shape[0] distance_matrix = np.zeros( shape=(n_samples, n_samples), dtype=np.float64 ) for i, j in itertools.product(range(n_samples), repeat=2): if i < j: distance_matrix[i, j] = self._vdm( X_encoded[i], X_encoded[j] ) distance_matrix[j, i] = distance_matrix[i, j] return distance_matrix def _vdm(self, x, y): """Compute VDM distance between 2 samples   Parameters  ----------  x, y : ndarray of shape (n_features,)  Samples from which to compute the distance.   Returns  -------  distance : float  The VDM distance.  """ n_features, distance = len(x), 0 for feature_idx in range(n_features): distance += sum([ distance_matrix[feature_idx][x[feature_idx], y[feature_idx]] for distance_matrix in self.distance_per_class_.values() ]) ** self.r return distance

chkoar · 2021-02-12T21:58:12Z

We might want to cynthonize it.

@glemaitre did you time it?

glemaitre · 2021-02-12T22:29:49Z

@glemaitre did you time it?

Nop but the for loop on the itertools.product is already a good hint :)

I have to write a bit more equations but I think that we can replace the entry of X_encoded by the sum of the distance_per_class_ and then call the cdist from NumPy to compute this part.

In this case, no need for Cython. Then we need to precompute X, and pass to the kNN of scikit-learn (a lot of stuff) I never did in practice :)

glemaitre · 2021-02-12T23:58:40Z

OK, this is the x20 faster implementation:

import numpy as np from scipy.spatial import distance_matrix from sklearn.preprocessing import OrdinalEncoder from sklearn.utils.multiclass import unique_labels class ValueDifferenceMetric: """Class implementing the Value Difference Metric.    This metric compute the distance between samples containing  only nominal categorical features.    Parameters  ----------  r : int, default=2  The exposent used when accumulating the distance. `r=1`  corresponds to Manhattan and `r=2` corresponds to  Euclidean distance.    Attributes  ----------  classes_ : ndarray of shape (n_classes,)  The classes.  proba_per_class_ : list of ndarray of shape (n_categories, n_classes)  List of length `n_features` containing the conditional  probabilities for each category given a class.  """ def __init__(self, r=1): self.r = r def fit(self, X, y): """Compute the necessary statistics from the training set.    Parameters  ----------  X : ndarray of shape (n_samples, n_features)  The input data.  y : ndarray of shape (n_features,)  The target.    Returns  -------  self  """ self.encoder_ = OrdinalEncoder(dtype=np.int32) self.classes_ = unique_labels(y) X_encoded = self.encoder_.fit_transform(X) n_features = X_encoded.shape[1] # list of length n_features of ndarray (n_categories, n_classes) counts_per_class = [ np.transpose( [ np.bincount( X_encoded[y == klass, feature_idx], minlength=len(self.encoder_.categories_[feature_idx]) ) for klass in self.classes_ ] ) for feature_idx in range(n_features) ] # list of length n_features of ndarray (n_categories,) proba_per_class = [ (counts_per_class[feature_idx] / counts_per_class[feature_idx].sum(axis=1)[:, np.newaxis]) for feature_idx in range(n_features) ] self.proba_per_class_ = proba_per_class return self def pairwise(self, X1, X2=None): """Compute the VDM distance pairwise.    Parameters  ----------  X1 : ndarray of shape (n_samples, n_features)  The input data.  X2 : ndarray of shape (n_samples, n_features)  The input data.    Returns  -------  distance_matrix : ndarray of shape (n_samples, n_samples)  The VDM pairwise distance.  """ X1_encoded = self.encoder_.transform(X1) if X2 is not None: X2_encoded = self.encoder_.transform(X2) n_samples_X2 = X2_encoded.shape[0] else: n_samples_X2 = n_samples_X1 n_samples_X1, n_features = X1_encoded.shape distance = np.zeros(shape=(n_samples_X1, n_samples_X2)) for feature_idx in range(n_features): proba_feature_X1 = self.proba_per_class_[feature_idx][ X1_encoded[:, feature_idx] ] if X2 is not None: proba_feature_X2 = self.proba_per_class_[feature_idx][ X2_encoded[:, feature_idx] ] else: proba_feature_X2 = proba_feature_X1 distance += distance_matrix( proba_feature_X1, proba_feature_X2, p=1) ** self.r return distance

glemaitre · 2021-02-13T00:31:01Z

And the use case within the SMOTE should be something like:

from sklearn.neighbors import NearestNeighbors metric = ValueDifferenceMetric(r=2) metric.fit(X, y) X_dist_fit = x.pairwise(X) X_query = x.pairwise(X, X2) nn = NearestNeighbors(metric="precomputed") nn.fit(X_dist_fit, y) knn.kneighbors(X_query.T, return_distance=False)

glemaitre · 2021-02-13T00:32:36Z

The only issue is that we have some precomputed distance matrices of (n_samples, n_samples) that could potentially by quite large. But I don't think we can do better without cythonizing some code.

Add SMOTE for pure categorical data

d1ac5bd

Fix PEP issues

e17ba76

chkoar changed the title ~~Add SMOTE for pure categorical data~~ [WIP] Add SMOTE for pure categorical data May 5, 2019

Fix failing tests

5f1ca14

Fix doctest error

c2fc4da

Change order of counter keys in docstring

9db1071

ThomasKluiters added 2 commits May 5, 2019 22:43

Rephrase docstring

c2fe8c2

Refactor SMOTEN and SMOTENC to be more unified

bc96223

ThomasKluiters force-pushed the smote-pure-categorical branch from 7914e0c to bc96223 Compare May 7, 2019 14:56

ThomasKluiters changed the title ~~[WIP] Add SMOTE for pure categorical data~~ [MRG] Add SMOTE for pure categorical data May 7, 2019

ThomasKluiters force-pushed the smote-pure-categorical branch from db477d2 to db0f17d Compare May 7, 2019 20:40

Add documentation for SMOTEN

c7e7036

ThomasKluiters force-pushed the smote-pure-categorical branch from db0f17d to c7e7036 Compare May 7, 2019 20:41

glemaitre added this to the 0.5 milestone Jun 11, 2019

chkoar changed the title ~~[MRG] Add SMOTE for pure categorical data~~ [MRG] SMOTE for pure categorical data Jun 12, 2019

chkoar changed the title ~~[MRG] SMOTE for pure categorical data~~ [MRG] ENH: SMOTE for pure categorical data Jun 12, 2019

glemaitre self-requested a review June 12, 2019 21:36

glemaitre force-pushed the master branch from f20d4d8 to f1bc189 Compare June 28, 2019 12:06

glemaitre force-pushed the master branch 4 times, most recently from eae6c6b to ffdde80 Compare June 28, 2019 13:52

glemaitre force-pushed the master branch from 65132db to 68123d0 Compare November 8, 2019 22:54

chkoar changed the title ~~[MRG] ENH: SMOTE for pure categorical data~~ [WIP] ENH: SMOTE for pure categorical data Nov 13, 2019

glemaitre modified the milestones: 0.5, 0.7, 0.6 Nov 17, 2019

glemaitre modified the milestones: 0.6, 0.7 Dec 5, 2019

chkoar mentioned this pull request Apr 16, 2020

SMOTENC fails when all features are categorical #562

Closed

chkoar force-pushed the master branch from 4a201cd to 0eb9033 Compare June 20, 2020 02:58

glemaitre modified the milestones: 0.7, 0.8 Nov 26, 2020

glemaitre mentioned this pull request Feb 15, 2021

FEA implements SMOTEN to handle nominal categorical features #802

Merged

glemaitre closed this in #802 Feb 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] ENH: SMOTE for pure categorical data #565

[WIP] ENH: SMOTE for pure categorical data #565

Uh oh!

ThomasKluiters commented May 5, 2019

pep8speaks commented May 5, 2019 •

edited

Loading

chkoar commented May 5, 2019

ThomasKluiters commented May 5, 2019

chkoar commented May 5, 2019

ThomasKluiters commented May 5, 2019

codecov bot commented May 5, 2019 •

edited

Loading

ThomasKluiters commented May 7, 2019

glemaitre commented Jun 7, 2019

glemaitre commented Jun 12, 2019 •

edited

Loading

ThomasKluiters commented Jun 13, 2019

glemaitre commented Jun 13, 2019

ThomasKluiters commented Jun 25, 2019

glemaitre commented Feb 12, 2021

chkoar commented Feb 12, 2021 •

edited

Loading

glemaitre commented Feb 12, 2021 •

edited

Loading

glemaitre commented Feb 12, 2021 •

edited

Loading

glemaitre commented Feb 13, 2021

glemaitre commented Feb 13, 2021

Labels

4 participants

[WIP] ENH: SMOTE for pure categorical data #565

[WIP] ENH: SMOTE for pure categorical data #565

Uh oh!

Conversation

ThomasKluiters commented May 5, 2019

Reference Issue

What does this implement/fix? Explain your changes.

Any other comments?

pep8speaks commented May 5, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated at 2019-05-07 20:41:38 UTC

chkoar commented May 5, 2019

ThomasKluiters commented May 5, 2019

chkoar commented May 5, 2019

ThomasKluiters commented May 5, 2019

codecov bot commented May 5, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

ThomasKluiters commented May 7, 2019

glemaitre commented Jun 7, 2019

glemaitre commented Jun 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ThomasKluiters commented Jun 13, 2019

glemaitre commented Jun 13, 2019

ThomasKluiters commented Jun 25, 2019

glemaitre commented Feb 12, 2021

chkoar commented Feb 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

glemaitre commented Feb 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

glemaitre commented Feb 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

glemaitre commented Feb 13, 2021

glemaitre commented Feb 13, 2021

Labels

4 participants

pep8speaks commented May 5, 2019 •

edited

Loading

codecov bot commented May 5, 2019 •

edited

Loading

glemaitre commented Jun 12, 2019 •

edited

Loading

chkoar commented Feb 12, 2021 •

edited

Loading

glemaitre commented Feb 12, 2021 •

edited

Loading

glemaitre commented Feb 12, 2021 •

edited

Loading