Skip to content

Conversation

@dvro
Copy link
Member

@dvro dvro commented May 29, 2016

Added Instance Hardness Threshold under sampling method.

  • unbalanced_dataset/under_sampling/instance_hardness_threshold.py
  • example/under-sampling/plot_instance_hardness_threshold.py

iht

The higher the threshold, the more samples of the majority class are removed.

@glemaitre glemaitre changed the title Instance hardness [WIP] Instance hardness May 30, 2016
@glemaitre
Copy link
Member

I saw some discrepancies with PEP8 standard. Could you make a check for that.

threshold : float, optional (default=0.3)
Threshold to be used when excluding samples (0.01 to 0.99).

mode: str, optional (default='maj')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rename mode by kind_sel. We used this keyword in the other part of the API for almost similar. I think this is in the ENN

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@glemaitre
Copy link
Member

yep you can push your changes

# Fit and transform x to visualise inside a 2D feature space
X_vis = pca.fit_transform(X)

# Apply the random under-sampling
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have some blanks lines. You should probably remove them

@glemaitre
Copy link
Member

@dvro Do you think that we could replace the parameter threshold by the parameter ratio. In fact, ratio could determine the probability to apply in order to obtain the desirable ratio between minority and majority classes.

In this case I would also remove the possibility to remove some instances from the minority class.

@fmfn
Copy link
Collaborator

fmfn commented May 30, 2016

Do you think that we could replace the parameter threshold by the parameter ratio.

I came here to say that, can we make the API consistent? ratio is the choice so far, would it make sense to re-name threshold?

@glemaitre
Copy link
Member

@fmfn I agree but I would even enforce the same behaviour for this parameter: number of minority samples over number of majority samples.

@fmfn
Copy link
Collaborator

fmfn commented May 30, 2016

@glemaitre Yes! Much better indeed.

@dvro
Copy link
Member Author

dvro commented May 31, 2016

I can rename it to ration, but imo ratio = #maj / #min (or the other way around); when I used threshold it means the probability threshold in which samples are removed. A user might think that, in setting ratio=0.5, the output would be 2*X samples of class A and X samples of class B ... when that is not the case. (I don't know if I am being clear about this).

What you guys think?

@fmfn
Copy link
Collaborator

fmfn commented May 31, 2016

I agree tha simply renaming it would cause confusion. I suppose what I mean was: is that a transformation that can be done to threshold though would allow us to translate it to a ratio?

@glemaitre
Copy link
Member

glemaitre commented May 31, 2016

For this case I would all the probability level such that the ratio is nearest of what we want. Roughly without trying it:

ratios = np.zeros(100, ) probs = np.linspace(0., 1., 100) for i, p in enumerate(probs): ratios[i] = self.stats_c_[self.min_c_] / np.count_nonzero(np.logical_or(probabilities >= p, y == self.min_c_)) ratios = np.abs(ratios - self.ratio) threshold = probs[ratios.argmin()] 
@dvro dvro closed this Jun 17, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

3 participants