- Notifications
You must be signed in to change notification settings - Fork 1.3k
[WIP] ENH: SMOTE for pure categorical data #565
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] ENH: SMOTE for pure categorical data #565
Conversation
Hello @ThomasKluiters! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
Comment last updated at 2019-05-07 20:41:38 UTC |
Thanks for this. CI are failing though. |
Will do! Do you think a user guide will suffice for this to be merged, or is more refactoring required? |
As you said
I think that we not in hurry. |
Yeah, of course! Just making sure what I should be doing :) |
Codecov Report
@@ Coverage Diff @@ ## master #565 +/- ## ========================================== - Coverage 97.93% 96.85% -1.08% ========================================== Files 83 85 +2 Lines 4784 5278 +494 ========================================== + Hits 4685 5112 +427 - Misses 99 166 +67
Continue to review full report at Codecov.
|
7914e0c
to bc96223
Compare db477d2
to db0f17d
Compare db0f17d
to c7e7036
Compare @chkoar could you look into my PR :) |
We will look at this before the release of 0.5 |
I think that we should implement the VDM distance as in Chawla et al. instead of reusing the SMOTE-NC implementation. |
Alright! I can do that - any other things you'd like to see change? |
I see that you introduced the parameter |
Alright, I will start working on these proposed changes. I'm a bit busy with other contributions. |
eae6c6b
to ffdde80
Compare I am putting an implementation of the VDM distance. I don't know how long it would be in practice. import itertools import numpy as np from sklearn.preprocessing import OrdinalEncoder from sklearn.utils.multiclass import unique_labels class ValueDifferenceMetric: """Class implementing the Value Difference Metric. This metric compute the distance between samples containing only nominal categorical features. Parameters ---------- r : int, default=2 The exposent used when accumulating the distance. `r=1` corresponds to Manhattan and `r=2` corresponds to Euclidean distance. Attributes ---------- classes_ : ndarray of shape (n_classes,) The classes. distance_per_class_ : dict Precomputed distance matrix for each feature and each class. """ def __init__(self, r=2): self.r = r def fit(self, X, y): """Compute the necessary statistics from the training set. Parameters ---------- X : ndarray of shape (n_samples, n_features) The input data. y : ndarray of shape (n_features,) The target. Returns ------- self """ self.encoder_ = OrdinalEncoder(dtype=np.int32) self.classes_ = unique_labels(y) X_encoded = self.encoder_.fit_transform(X) n_features = X_encoded.shape[1] # compute the categories counts per feature per class counts_per_class = { klass: [ np.bincount( X_encoded[y == klass, feature_idx], minlength=len(self.encoder_.categories_[feature_idx]) ).astype(np.float64) for feature_idx in range(n_features) ] for klass in self.classes_ } # compute the total categories counts per feature total_counts = [ np.zeros_like(counts_per_class[self.classes_[0]][feature_idx]) for feature_idx in range(n_features) ] for feature_idx in range(n_features): for klass in self.classes_: total_counts[feature_idx] += counts_per_class[klass][feature_idx] # compute the categorie probabilities per feature per class for feature_idx in range(n_features): for klass in self.classes_: counts_per_class[klass][feature_idx] /= total_counts[feature_idx] # compute the precomputed distances matrix for all combinations self.distance_per_class_ = { klass: [ np.abs( np.subtract.outer( counts_per_class[klass][feature_idx], counts_per_class[klass][feature_idx] ) ) for feature_idx in range(n_features) ] for klass in self.classes_ } return self def pairwise(self, X): """Compute the VDM distance pairwise. Parameters ---------- X : ndarray of shape (n_samples, n_features) The input data. Returns ------- distance_matrix : ndarray of shape (n_samples, n_samples) The VDM pairwise distance. """ X_encoded = self.encoder_.transform(X) n_samples = X_encoded.shape[0] distance_matrix = np.zeros( shape=(n_samples, n_samples), dtype=np.float64 ) for i, j in itertools.product(range(n_samples), repeat=2): if i < j: distance_matrix[i, j] = self._vdm( X_encoded[i], X_encoded[j] ) distance_matrix[j, i] = distance_matrix[i, j] return distance_matrix def _vdm(self, x, y): """Compute VDM distance between 2 samples Parameters ---------- x, y : ndarray of shape (n_features,) Samples from which to compute the distance. Returns ------- distance : float The VDM distance. """ n_features, distance = len(x), 0 for feature_idx in range(n_features): distance += sum([ distance_matrix[feature_idx][x[feature_idx], y[feature_idx]] for distance_matrix in self.distance_per_class_.values() ]) ** self.r return distance |
@glemaitre did you time it? |
@glemaitre did you time it? Nop but the I have to write a bit more equations but I think that we can replace the entry of In this case, no need for Cython. Then we need to precompute X, and pass to the kNN of scikit-learn (a lot of stuff) I never did in practice :) |
OK, this is the x20 faster implementation: import numpy as np from scipy.spatial import distance_matrix from sklearn.preprocessing import OrdinalEncoder from sklearn.utils.multiclass import unique_labels class ValueDifferenceMetric: """Class implementing the Value Difference Metric. This metric compute the distance between samples containing only nominal categorical features. Parameters ---------- r : int, default=2 The exposent used when accumulating the distance. `r=1` corresponds to Manhattan and `r=2` corresponds to Euclidean distance. Attributes ---------- classes_ : ndarray of shape (n_classes,) The classes. proba_per_class_ : list of ndarray of shape (n_categories, n_classes) List of length `n_features` containing the conditional probabilities for each category given a class. """ def __init__(self, r=1): self.r = r def fit(self, X, y): """Compute the necessary statistics from the training set. Parameters ---------- X : ndarray of shape (n_samples, n_features) The input data. y : ndarray of shape (n_features,) The target. Returns ------- self """ self.encoder_ = OrdinalEncoder(dtype=np.int32) self.classes_ = unique_labels(y) X_encoded = self.encoder_.fit_transform(X) n_features = X_encoded.shape[1] # list of length n_features of ndarray (n_categories, n_classes) counts_per_class = [ np.transpose( [ np.bincount( X_encoded[y == klass, feature_idx], minlength=len(self.encoder_.categories_[feature_idx]) ) for klass in self.classes_ ] ) for feature_idx in range(n_features) ] # list of length n_features of ndarray (n_categories,) proba_per_class = [ (counts_per_class[feature_idx] / counts_per_class[feature_idx].sum(axis=1)[:, np.newaxis]) for feature_idx in range(n_features) ] self.proba_per_class_ = proba_per_class return self def pairwise(self, X1, X2=None): """Compute the VDM distance pairwise. Parameters ---------- X1 : ndarray of shape (n_samples, n_features) The input data. X2 : ndarray of shape (n_samples, n_features) The input data. Returns ------- distance_matrix : ndarray of shape (n_samples, n_samples) The VDM pairwise distance. """ X1_encoded = self.encoder_.transform(X1) if X2 is not None: X2_encoded = self.encoder_.transform(X2) n_samples_X2 = X2_encoded.shape[0] else: n_samples_X2 = n_samples_X1 n_samples_X1, n_features = X1_encoded.shape distance = np.zeros(shape=(n_samples_X1, n_samples_X2)) for feature_idx in range(n_features): proba_feature_X1 = self.proba_per_class_[feature_idx][ X1_encoded[:, feature_idx] ] if X2 is not None: proba_feature_X2 = self.proba_per_class_[feature_idx][ X2_encoded[:, feature_idx] ] else: proba_feature_X2 = proba_feature_X1 distance += distance_matrix( proba_feature_X1, proba_feature_X2, p=1) ** self.r return distance |
And the use case within the SMOTE should be something like: from sklearn.neighbors import NearestNeighbors metric = ValueDifferenceMetric(r=2) metric.fit(X, y) X_dist_fit = x.pairwise(X) X_query = x.pairwise(X, X2) nn = NearestNeighbors(metric="precomputed") nn.fit(X_dist_fit, y) knn.kneighbors(X_query.T, return_distance=False) |
The only issue is that we have some precomputed distance matrices of |
Reference Issue
What does this implement/fix? Explain your changes.
I've seen some issues regarding the need for SMOTE with just categorical data.
Therefore I've added a class SMOTEN which operates on purely categorical data.
Any other comments?
This is not the final version but I'd love some feedback on the code - I understand there might be more elegant ways to implement said class as there is some code duplication going on between SMOTENC and SMOTEN - I can look into this issue.