Skip to content

Conversation

ThomasKluiters
Copy link

Reference Issue

What does this implement/fix? Explain your changes.

I've seen some issues regarding the need for SMOTE with just categorical data.

Therefore I've added a class SMOTEN which operates on purely categorical data.

Any other comments?

This is not the final version but I'd love some feedback on the code - I understand there might be more elegant ways to implement said class as there is some code duplication going on between SMOTENC and SMOTEN - I can look into this issue.

@pep8speaks
Copy link

pep8speaks commented May 5, 2019

Hello @ThomasKluiters! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 953:80: E501 line too long (86 > 79 characters)
Line 1215:80: E501 line too long (81 > 79 characters)
Line 1248:1: W391 blank line at end of file

Comment last updated at 2019-05-07 20:41:38 UTC
@chkoar
Copy link
Member

chkoar commented May 5, 2019

Thanks for this. CI are failing though.
I suppose that we need a example with this (or to include it in existing one) and some user guide documentation as well.

@chkoar chkoar changed the title Add SMOTE for pure categorical data [WIP] Add SMOTE for pure categorical data May 5, 2019
@ThomasKluiters
Copy link
Author

Thanks for this. CI are failing though.
I suppose that we need a example with this (or to include it in existing one) and some user guide documentation as well.

Will do! Do you think a user guide will suffice for this to be merged, or is more refactoring required?

@chkoar
Copy link
Member

chkoar commented May 5, 2019

As you said

there is some code duplication going on between SMOTENC and SMOTEN - I can look into this issue.

I think that we not in hurry.

@ThomasKluiters
Copy link
Author

As you said

there is some code duplication going on between SMOTENC and SMOTEN - I can look into this issue.

I think that we not in hurry.

Yeah, of course! Just making sure what I should be doing :)

@codecov
Copy link

codecov bot commented May 5, 2019

Codecov Report

Merging #565 into master will decrease coverage by 1.07%.
The diff coverage is 99.23%.

Impacted file tree graph

@@ Coverage Diff @@ ## master #565 +/- ## ========================================== - Coverage 97.93% 96.85% -1.08%  ========================================== Files 83 85 +2 Lines 4784 5278 +494 ========================================== + Hits 4685 5112 +427  - Misses 99 166 +67
Impacted Files Coverage Δ
imblearn/over_sampling/_smote.py 97.75% <100%> (+0.54%) ⬆️
imblearn/over_sampling/__init__.py 100% <100%> (ø) ⬆️
imblearn/utils/estimator_checks.py 96.74% <100%> (+7.85%) ⬆️
imblearn/over_sampling/tests/test_smote_n.py 99.02% <99.02%> (ø)
imblearn/keras/tests/test_generator.py 8.92% <0%> (-91.08%) ⬇️
imblearn/tensorflow/_generator.py 34.28% <0%> (-65.72%) ⬇️
imblearn/keras/_generator.py 40.35% <0%> (-58.16%) ⬇️
imblearn/datasets/_zenodo.py 93.75% <0%> (-2.92%) ⬇️
imblearn/ensemble/_weight_boosting.py 96.8% <0%> (-0.9%) ⬇️
imblearn/ensemble/_forest.py 98.09% <0%> (-0.07%) ⬇️
... and 64 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 68123d0...c7e7036. Read the comment docs.

@ThomasKluiters ThomasKluiters force-pushed the smote-pure-categorical branch from 7914e0c to bc96223 Compare May 7, 2019 14:56
@ThomasKluiters ThomasKluiters changed the title [WIP] Add SMOTE for pure categorical data [MRG] Add SMOTE for pure categorical data May 7, 2019
@ThomasKluiters ThomasKluiters force-pushed the smote-pure-categorical branch from db477d2 to db0f17d Compare May 7, 2019 20:40
@ThomasKluiters ThomasKluiters force-pushed the smote-pure-categorical branch from db0f17d to c7e7036 Compare May 7, 2019 20:41
@ThomasKluiters
Copy link
Author

@chkoar could you look into my PR :)

@glemaitre
Copy link
Member

We will look at this before the release of 0.5

@glemaitre glemaitre added this to the 0.5 milestone Jun 11, 2019
@chkoar chkoar changed the title [MRG] Add SMOTE for pure categorical data [MRG] SMOTE for pure categorical data Jun 12, 2019
@chkoar chkoar changed the title [MRG] SMOTE for pure categorical data [MRG] ENH: SMOTE for pure categorical data Jun 12, 2019
@glemaitre glemaitre self-requested a review June 12, 2019 21:36
@glemaitre
Copy link
Member

glemaitre commented Jun 12, 2019

I think that we should implement the VDM distance as in Chawla et al. instead of reusing the SMOTE-NC implementation.

@ThomasKluiters
Copy link
Author

I think that we should implement the VDM distance as in Chawla et al. instead of reusing the SMOTE-NC implementation.

Alright! I can do that - any other things you'd like to see change?

@glemaitre
Copy link
Member

Alright! I can do that - any other things you'd like to see change?

I see that you introduced the parameter kind='regular'. We actually dropped those. I think that this better to inherit directly from BaseSMOTE instead of SMOTENC.

@ThomasKluiters
Copy link
Author

Alright! I can do that - any other things you'd like to see change?

I see that you introduced the parameter kind='regular'. We actually dropped those. I think that this better to inherit directly from BaseSMOTE instead of SMOTENC.

Alright, I will start working on these proposed changes. I'm a bit busy with other contributions.

@glemaitre glemaitre force-pushed the master branch 4 times, most recently from eae6c6b to ffdde80 Compare June 28, 2019 13:52
@chkoar chkoar changed the title [MRG] ENH: SMOTE for pure categorical data [WIP] ENH: SMOTE for pure categorical data Nov 13, 2019
@glemaitre glemaitre modified the milestones: 0.5, 0.7, 0.6 Nov 17, 2019
@glemaitre glemaitre modified the milestones: 0.6, 0.7 Dec 5, 2019
@glemaitre glemaitre modified the milestones: 0.7, 0.8 Nov 26, 2020
@glemaitre
Copy link
Member

I am putting an implementation of the VDM distance. I don't know how long it would be in practice.
We might want to cynthonize it.

import itertools import numpy as np from sklearn.preprocessing import OrdinalEncoder from sklearn.utils.multiclass import unique_labels class ValueDifferenceMetric: """Class implementing the Value Difference Metric.    This metric compute the distance between samples containing  only nominal categorical features.    Parameters  ----------  r : int, default=2  The exposent used when accumulating the distance. `r=1`  corresponds to Manhattan and `r=2` corresponds to  Euclidean distance.    Attributes  ----------  classes_ : ndarray of shape (n_classes,)  The classes.  distance_per_class_ : dict  Precomputed distance matrix for each feature and each class.  """ def __init__(self, r=2): self.r = r def fit(self, X, y): """Compute the necessary statistics from the training set.    Parameters  ----------  X : ndarray of shape (n_samples, n_features)  The input data.  y : ndarray of shape (n_features,)  The target.    Returns  -------  self  """ self.encoder_ = OrdinalEncoder(dtype=np.int32) self.classes_ = unique_labels(y) X_encoded = self.encoder_.fit_transform(X) n_features = X_encoded.shape[1] # compute the categories counts per feature per class counts_per_class = { klass: [ np.bincount( X_encoded[y == klass, feature_idx], minlength=len(self.encoder_.categories_[feature_idx]) ).astype(np.float64) for feature_idx in range(n_features) ] for klass in self.classes_ } # compute the total categories counts per feature total_counts = [ np.zeros_like(counts_per_class[self.classes_[0]][feature_idx]) for feature_idx in range(n_features) ] for feature_idx in range(n_features): for klass in self.classes_: total_counts[feature_idx] += counts_per_class[klass][feature_idx] # compute the categorie probabilities per feature per class for feature_idx in range(n_features): for klass in self.classes_: counts_per_class[klass][feature_idx] /= total_counts[feature_idx] # compute the precomputed distances matrix for all combinations self.distance_per_class_ = { klass: [ np.abs( np.subtract.outer( counts_per_class[klass][feature_idx], counts_per_class[klass][feature_idx] ) ) for feature_idx in range(n_features) ] for klass in self.classes_ } return self def pairwise(self, X): """Compute the VDM distance pairwise.    Parameters  ----------  X : ndarray of shape (n_samples, n_features)  The input data.    Returns  -------  distance_matrix : ndarray of shape (n_samples, n_samples)  The VDM pairwise distance.  """ X_encoded = self.encoder_.transform(X) n_samples = X_encoded.shape[0] distance_matrix = np.zeros( shape=(n_samples, n_samples), dtype=np.float64 ) for i, j in itertools.product(range(n_samples), repeat=2): if i < j: distance_matrix[i, j] = self._vdm( X_encoded[i], X_encoded[j] ) distance_matrix[j, i] = distance_matrix[i, j] return distance_matrix def _vdm(self, x, y): """Compute VDM distance between 2 samples   Parameters  ----------  x, y : ndarray of shape (n_features,)  Samples from which to compute the distance.   Returns  -------  distance : float  The VDM distance.  """ n_features, distance = len(x), 0 for feature_idx in range(n_features): distance += sum([ distance_matrix[feature_idx][x[feature_idx], y[feature_idx]] for distance_matrix in self.distance_per_class_.values() ]) ** self.r return distance
@chkoar
Copy link
Member

chkoar commented Feb 12, 2021

We might want to cynthonize it.

@glemaitre did you time it?

@glemaitre
Copy link
Member

glemaitre commented Feb 12, 2021

@glemaitre did you time it?

Nop but the for loop on the itertools.product is already a good hint :)

I have to write a bit more equations but I think that we can replace the entry of X_encoded by the sum of the distance_per_class_ and then call the cdist from NumPy to compute this part.

In this case, no need for Cython. Then we need to precompute X, and pass to the kNN of scikit-learn (a lot of stuff) I never did in practice :)

@glemaitre
Copy link
Member

glemaitre commented Feb 12, 2021

OK, this is the x20 faster implementation:

import numpy as np from scipy.spatial import distance_matrix from sklearn.preprocessing import OrdinalEncoder from sklearn.utils.multiclass import unique_labels class ValueDifferenceMetric: """Class implementing the Value Difference Metric.    This metric compute the distance between samples containing  only nominal categorical features.    Parameters  ----------  r : int, default=2  The exposent used when accumulating the distance. `r=1`  corresponds to Manhattan and `r=2` corresponds to  Euclidean distance.    Attributes  ----------  classes_ : ndarray of shape (n_classes,)  The classes.  proba_per_class_ : list of ndarray of shape (n_categories, n_classes)  List of length `n_features` containing the conditional  probabilities for each category given a class.  """ def __init__(self, r=1): self.r = r def fit(self, X, y): """Compute the necessary statistics from the training set.    Parameters  ----------  X : ndarray of shape (n_samples, n_features)  The input data.  y : ndarray of shape (n_features,)  The target.    Returns  -------  self  """ self.encoder_ = OrdinalEncoder(dtype=np.int32) self.classes_ = unique_labels(y) X_encoded = self.encoder_.fit_transform(X) n_features = X_encoded.shape[1] # list of length n_features of ndarray (n_categories, n_classes) counts_per_class = [ np.transpose( [ np.bincount( X_encoded[y == klass, feature_idx], minlength=len(self.encoder_.categories_[feature_idx]) ) for klass in self.classes_ ] ) for feature_idx in range(n_features) ] # list of length n_features of ndarray (n_categories,) proba_per_class = [ (counts_per_class[feature_idx] / counts_per_class[feature_idx].sum(axis=1)[:, np.newaxis]) for feature_idx in range(n_features) ] self.proba_per_class_ = proba_per_class return self def pairwise(self, X1, X2=None): """Compute the VDM distance pairwise.    Parameters  ----------  X1 : ndarray of shape (n_samples, n_features)  The input data.  X2 : ndarray of shape (n_samples, n_features)  The input data.    Returns  -------  distance_matrix : ndarray of shape (n_samples, n_samples)  The VDM pairwise distance.  """ X1_encoded = self.encoder_.transform(X1) if X2 is not None: X2_encoded = self.encoder_.transform(X2) n_samples_X2 = X2_encoded.shape[0] else: n_samples_X2 = n_samples_X1 n_samples_X1, n_features = X1_encoded.shape distance = np.zeros(shape=(n_samples_X1, n_samples_X2)) for feature_idx in range(n_features): proba_feature_X1 = self.proba_per_class_[feature_idx][ X1_encoded[:, feature_idx] ] if X2 is not None: proba_feature_X2 = self.proba_per_class_[feature_idx][ X2_encoded[:, feature_idx] ] else: proba_feature_X2 = proba_feature_X1 distance += distance_matrix( proba_feature_X1, proba_feature_X2, p=1) ** self.r return distance 
@glemaitre
Copy link
Member

And the use case within the SMOTE should be something like:

from sklearn.neighbors import NearestNeighbors metric = ValueDifferenceMetric(r=2) metric.fit(X, y) X_dist_fit = x.pairwise(X) X_query = x.pairwise(X, X2) nn = NearestNeighbors(metric="precomputed") nn.fit(X_dist_fit, y) knn.kneighbors(X_query.T, return_distance=False)
@glemaitre
Copy link
Member

The only issue is that we have some precomputed distance matrices of (n_samples, n_samples) that could potentially by quite large. But I don't think we can do better without cythonizing some code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

4 participants