Skip to content
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
47 changes: 35 additions & 12 deletions doc/under_sampling.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,25 @@ Under-sampling

.. currentmodule:: imblearn.under_sampling

You can refer to
:ref:`sphx_glr_auto_examples_under-sampling_plot_comparison_under_sampling.py`.
One way of handling imbalanced datasets is to reduce the number of observations from
the majority class or classes. The most well known algorithm in this group is random
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the majority class or classes

I think it is a bit weird to say majority classes since usually we would refer to a single class. What about saying something like:

"reduce the number of observations from all classes but the minority class (i.e. the one with the least number of observations)."

undersampling, where samples from the majority classes are removed at random.

But there are many other algorithms to help us reduce the number of observations in the
dataset. These algorithms can be grouped based on their undersampling strategy into:

- Prototype generation methods.
- Prototype selection methods.

And within the latter, we find:

- Controlled undersampling
- Cleaning methods

We will discuss the different algorithms throughout this document.

Refer to :ref:`sphx_glr_auto_examples_under-sampling_plot_comparison_under_sampling.py`
for a comparison of the different undersampling methodologies.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that we this sentence because sphinx will use the title which is already stating "Compare under-sampling samplers".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand this suggestion, but let me try a fix

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I got it.


.. _cluster_centroids:

Expand All @@ -16,7 +33,7 @@ Prototype generation

Given an original data set :math:`S`, prototype generation algorithms will
generate a new set :math:`S'` where :math:`|S'| < |S|` and :math:`S' \not\subset
S`. In other words, prototype generation technique will reduce the number of
S`. In other words, prototype generation techniques will reduce the number of
samples in the targeted classes but the remaining samples are generated --- and
not selected --- from the original set.

Expand Down Expand Up @@ -61,16 +78,22 @@ original one.
Prototype selection
===================

On the contrary to prototype generation algorithms, prototype selection
algorithms will select samples from the original set :math:`S`. Therefore,
:math:`S'` is defined such as :math:`|S'| < |S|` and :math:`S' \subset S`.
Prototype selection algorithms will select samples from the original set :math:`S`,
generating a dataset :math:`S'`, where :math:`|S'| < |S|` and :math:`S' \subset S`. In
other words, :math:`S'` is a subset of :math:`S`.

Prototype selection algorithms can be divided into two groups: (i) controlled
under-sampling techniques and (ii) cleaning under-sampling techniques.

Controlled under-sampling methods reduce the number of observations in the majority
class or classes to an arbitrary number of samples specified by the user. Typically,
they reduce the number of observations to the number of samples observed in the
minority class.

In addition, these algorithms can be divided into two groups: (i) the
controlled under-sampling techniques and (ii) the cleaning under-sampling
techniques. The first group of methods allows for an under-sampling strategy in
which the number of samples in :math:`S'` is specified by the user. By
contrast, cleaning under-sampling techniques do not allow this specification
and are meant for cleaning the feature space.
In contrast, cleaning under-sampling techniques "clean" the feature space by removing
either "noisy" or "too easy to classify" observations, depending on the method. The
final number of observations in each class varies with the cleaning method and can't be
specified by the user.

.. _controlled_under_sampling:

Expand Down