Skip to content

Commit 1e23805

Browse files
committed
DOC: Correct confusing docs of n_values in OneHotEncoder
1 parent 9fe25df commit 1e23805

File tree

2 files changed

+16
-2
lines changed

2 files changed

+16
-2
lines changed

doc/modules/preprocessing.rst

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -411,6 +411,18 @@ Then we fit the estimator, and transform a data point.
411411
In the result, the first two numbers encode the gender, the next set of three
412412
numbers the continent and the last four the web browser.
413413

414+
Note that, if there is a possibilty that the training data might have missing categorical
415+
features, one has to explicitly set ``n_values``. For example,
416+
417+
>>> enc = preprocessing.OneHotEncoder(n_values=[2, 3, 4])
418+
>>> # Note that for there are missing categorical values for the 2nd and 3rd
419+
>>> # feature
420+
>>> enc.fit([[1, 2, 3], [0, 2, 0]]) # doctest: +ELLIPSIS
421+
OneHotEncoder(categorical_features='all', dtype=<... 'numpy.float64'>,
422+
handle_unknown='error', n_values=[2, 3, 4], sparse=True)
423+
>>> enc.transform([[1, 0, 0]]).toarray()
424+
array([[ 0., 1., 1., 0., 0., 1., 0., 0., 0.]])
425+
414426
See :ref:`dict_feature_extraction` for categorical features that are represented
415427
as a dict, not as integers.
416428

sklearn/preprocessing/data.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1650,8 +1650,10 @@ class OneHotEncoder(BaseEstimator, TransformerMixin):
16501650
Number of values per feature.
16511651
16521652
- 'auto' : determine value range from training data.
1653-
- int : maximum value for all features.
1654-
- array : maximum value per feature.
1653+
- int : number of categorical values per feature.
1654+
Each feature value should be in ``range(n_values)``
1655+
- array : ``n_values[i]`` is the number of categorical values in
1656+
``X[:, i]``. Each feature value should be in ``range(n_values[i])``
16551657
16561658
categorical_features: "all" or array of indices or mask
16571659
Specify what features are treated as categorical.

0 commit comments

Comments
 (0)