zheng-da
diff --git a/‎doc/modules/classes.rst‎
Lines changed: 1 addition & 0 deletions b/‎doc/modules/classes.rst‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎doc/modules/linear_model.rst‎
Lines changed: 77 additions & 7 deletions b/‎doc/modules/linear_model.rst‎
Lines changed: 77 additions & 7 deletions
diff --git a/‎doc/whats_new.rst‎
Lines changed: 3 additions & 0 deletions b/‎doc/whats_new.rst‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎examples/linear_model/plot_huber_vs_ridge.py‎
Lines changed: 65 additions & 0 deletions b/‎examples/linear_model/plot_huber_vs_ridge.py‎
Lines changed: 65 additions & 0 deletions
diff --git a/‎examples/linear_model/plot_robust_fit.py‎
Lines changed: 16 additions & 8 deletions b/‎examples/linear_model/plot_robust_fit.py‎
Lines changed: 16 additions & 8 deletions
diff --git a/‎sklearn/linear_model/__init__.py‎
Lines changed: 2 additions & 1 deletion b/‎sklearn/linear_model/__init__.py‎
Lines changed: 2 additions & 1 deletion
@@ -689,6 +689,7 @@ Kernels:
  linear_model.BayesianRidge
  linear_model.ElasticNet
  linear_model.ElasticNetCV
+ linear_model.HuberRegressor
  linear_model.Lars
  linear_model.LarsCV
  linear_model.Lasso
 
@@ -902,15 +902,24 @@ in these settings.
 
 .. topic:: **Trade-offs: which estimator?**
 
- Scikit-learn provides 2 robust regression estimators:
- :ref:`RANSAC <ransac_regression>` and
- :ref:`Theil Sen <theil_sen_regression>`
+ Scikit-learn provides 3 robust regression estimators:
+ :ref:`RANSAC <ransac_regression>`,
+ :ref:`Theil Sen <theil_sen_regression>` and
+ :ref:`HuberRegressor <huber_regression>`
 
- * :ref:`RANSAC <ransac_regression>` is faster, and scales much better
- with the number of samples
+ * :ref:`HuberRegressor <huber_regression>` should be faster than
+ :ref:`RANSAC <ransac_regression>` and :ref:`Theil Sen <theil_sen_regression>`
+ unless the number of samples are very large, i.e ``n_samples`` >> ``n_features``.
+ This is because :ref:`RANSAC <ransac_regression>` and :ref:`Theil Sen <theil_sen_regression>`
+ fit on smaller subsets of the data. However, both :ref:`Theil Sen <theil_sen_regression>`
+ and :ref:`RANSAC <ransac_regression>` are unlikely to be as robust as
+ :ref:`HuberRegressor <huber_regression>` for the default parameters.
 
- * :ref:`RANSAC <ransac_regression>` will deal better with large
- outliers in the y direction (most common situation)
+ * :ref:`RANSAC <ransac_regression>` is faster than :ref:`Theil Sen <theil_sen_regression>`
+ and scales much better with the number of samples
+
+ * :ref:`RANSAC <ransac_regression>` will deal better with large
+ outliers in the y direction (most common situation)
 
  * :ref:`Theil Sen <theil_sen_regression>` will cope better with
  medium-size outliers in the X direction, but this property will
@@ -1050,6 +1059,67 @@ considering only a random subset of all possible combinations.
 
  .. [#f2] T. Kärkkäinen and S. Äyrämö: `On Computation of Spatial Median for Robust Data Mining. <http://users.jyu.fi/~samiayr/pdf/ayramo_eurogen05.pdf>`_
 
+.. _huber_regression:
+
+Huber Regression
+----------------
+
+The :class:`HuberRegressor` is different to :class:`Ridge` because it applies a
+linear loss to samples that are classified as outliers.
+A sample is classified as an inlier if the absolute error of that sample is
+lesser than a certain threshold. It differs from :class:`TheilSenRegressor`
+and :class:`RANSACRegressor` because it does not ignore the effect of the outliers
+but gives a lesser weight to them.
+
+.. figure:: ../auto_examples/linear_model/images/plot_huber_vs_ridge_001.png
+ :target: ../auto_examples/linear_model/plot_huber_vs_ridge.html
+ :align: center
+ :scale: 50%
+
+The loss function that :class:`HuberRegressor` minimizes is given by
+
+.. math::
+
+ \underset{w, \sigma}{min\,} {\sum_{i=1}^n\left(\sigma + H_m\left(\frac{X_{i}w - y_{i}}{\sigma}\right)\sigma\right) + \alpha {||w||_2}^2}
+
+where
+
+.. math::
+
+ H_m(z) = \begin{cases}
+ z^2, & \text {if } |z| < \epsilon, \\
+ 2\epsilon|z| - \epsilon^2, & \text{otherwise}
+ \end{cases}
+
+It is advised to set the parameter ``epsilon`` to 1.35 to achieve 95% statistical efficiency.
+
+Notes
+-----
+The :class:`HuberRegressor` differs from using :class:`SGDRegressor` with loss set to `huber`
+in the following ways.
+
+- :class:`HuberRegressor` is scaling invariant. Once ``epsilon`` is set, scaling ``X`` and ``y``
+ down or up by different values would produce the same robustness to outliers as before.
+ as compared to :class:`SGDRegressor` where ``epsilon`` has to be set again when ``X`` and ``y`` are
+ scaled.
+
+- :class:`HuberRegressor` should be more efficient to use on data with small number of
+ samples while :class:`SGDRegressor` needs a number of passes on the training data to
+ produce the same robustness.
+
+.. topic:: Examples:
+
+ * :ref:`example_linear_model_plot_huber_vs_ridge.py`
+
+.. topic:: References:
+
+ .. [#f1] Peter J. Huber, Elvezio M. Ronchetti: Robust Statistics, Concomitant scale estimates, pg 172
+
+Also, this estimator is different from the R implementation of Robust Regression
+(http://www.ats.ucla.edu/stat/r/dae/rreg.htm) because the R implementation does a weighted least
+squares implementation with weights given to each sample on the basis of how much the residual is
+greater than a certain threshold.
+
 .. _polynomial_regression:
 
 Polynomial regression: extending linear models with basis functions
 
@@ -35,6 +35,9 @@ New features
  - Added new supervised learning algorithm: :ref:`Multi-layer Perceptron <multilayer_perceptron>`
  (`#3204 <https://github.com/scikit-learn/scikit-learn/pull/3204>`_) by `Issam H. Laradji`_
 
+ - Added :class:`linear_model.HuberRegressor`, a linear model robust to outliers.
+ (`#5291 <https://github.com/scikit-learn/scikit-learn/pull/5291>`_) by `Manoj Kumar`_.
+
 Enhancements
 ............
 
 
@@ -0,0 +1,65 @@
+"""
+=======================================================
+HuberRegressor vs Ridge on dataset with strong outliers
+=======================================================
+
+Fit Ridge and HuberRegressor on a dataset with outliers.
+
+The example shows that the predictions in ridge are strongly influenced
+by the outliers present in the dataset. The Huber regressor is less
+influenced by the outliers since the model uses the linear loss for these.
+As the parameter epsilon is increased for the Huber regressor, the decision
+function approaches that of the ridge.
+"""
+
+# Authors: Manoj Kumar mks542@nyu.edu
+# License: BSD 3 clause
+
+print(__doc__)
+
+import numpy as np
+import matplotlib.pyplot as plt
+
+from sklearn.datasets import make_regression
+from sklearn.linear_model import HuberRegressor, Ridge
+
+# Generate toy data.
+rng = np.random.RandomState(0)
+X, y = make_regression(n_samples=20, n_features=1, random_state=0, noise=4.0,
+ bias=100.0)
+
+# Add four strong outliers to the dataset.
+X_outliers = rng.normal(0, 0.5, size=(4, 1))
+y_outliers = rng.normal(0, 2.0, size=4)
+X_outliers[:2, :] += X.max() + X.mean() / 4.
+X_outliers[2:, :] += X.min() - X.mean() / 4.
+y_outliers[:2] += y.min() - y.mean() / 4.
+y_outliers[2:] += y.max() + y.mean() / 4.
+X = np.vstack((X, X_outliers))
+y = np.concatenate((y, y_outliers))
+plt.plot(X, y, 'b.')
+
+# Fit the huber regressor over a series of epsilon values.
+colors = ['r-', 'b-', 'y-', 'm-']
+
+x = np.linspace(X.min(), X.max(), 7)
+epsilon_values = [1.35, 1.5, 1.75, 1.9]
+for k, epsilon in enumerate(epsilon_values):
+ huber = HuberRegressor(fit_intercept=True, alpha=0.0, max_iter=100,
+ epsilon=epsilon)
+ huber.fit(X, y)
+ coef_ = huber.coef_ * x + huber.intercept_
+ plt.plot(x, coef_, colors[k], label="huber loss, %s" % epsilon)
+
+# Fit a ridge regressor to compare it to huber regressor.
+ridge = Ridge(fit_intercept=True, alpha=0.0, random_state=0, normalize=True)
+ridge.fit(X, y)
+coef_ridge = ridge.coef_
+coef_ = ridge.coef_ * x + ridge.intercept_
+plt.plot(x, coef_, 'g-', label="ridge regression")
+
+plt.title("Comparison of HuberRegressor vs Ridge")
+plt.xlabel("X")
+plt.ylabel("y")
+plt.legend(loc=0)
+plt.show()
@@ -22,14 +22,20 @@
 - RANSAC is good for strong outliers in the y direction
 
 - TheilSen is good for small outliers, both in direction X and y, but has
- a break point above which it performs worst than OLS.
+ a break point above which it performs worse than OLS.
+
+- The scores of HuberRegressor may not be compared directly to both TheilSen
+ and RANSAC because it does not attempt to completely filter the outliers
+ but lessen their effect.
 
 """
 
 from matplotlib import pyplot as plt
 import numpy as np
 
-from sklearn import linear_model, metrics
+from sklearn.linear_model import (
+ LinearRegression, TheilSenRegressor, RANSACRegressor, HuberRegressor)
+from sklearn.metrics import mean_squared_error
 from sklearn.preprocessing import PolynomialFeatures
 from sklearn.pipeline import make_pipeline
 
@@ -56,12 +62,14 @@
 X_errors_large = X.copy()
 X_errors_large[::3] = 10
 
-estimators = [('OLS', linear_model.LinearRegression()),
- ('Theil-Sen', linear_model.TheilSenRegressor(random_state=42)),
- ('RANSAC', linear_model.RANSACRegressor(random_state=42)), ]
-colors = {'OLS': 'turquoise', 'Theil-Sen': 'gold', 'RANSAC': 'lightgreen'}
-linestyle = {'OLS': '-', 'Theil-Sen': '-.', 'RANSAC': '--'}
+estimators = [('OLS', LinearRegression()),
+ ('Theil-Sen', TheilSenRegressor(random_state=42)),
+ ('RANSAC', RANSACRegressor(random_state=42)),
+ ('HuberRegressor', HuberRegressor())]
+colors = {'OLS': 'turquoise', 'Theil-Sen': 'gold', 'RANSAC': 'lightgreen', 'HuberRegressor': 'black'}
+linestyle = {'OLS': '-', 'Theil-Sen': '-.', 'RANSAC': '--', 'HuberRegressor': '--'}
 lw = 3
+
 x_plot = np.linspace(X.min(), X.max())
 for title, this_X, this_y in [
  ('Modeling Errors Only', X, y),
@@ -75,7 +83,7 @@
  for name, estimator in estimators:
  model = make_pipeline(PolynomialFeatures(3), estimator)
  model.fit(this_X, this_y)
- mse = metrics.mean_squared_error(model.predict(X_test), y_test)
+ mse = mean_squared_error(model.predict(X_test), y_test)
  y_plot = model.predict(x_plot[:, np.newaxis])
  plt.plot(x_plot, y_plot, color=colors[name], linestyle=linestyle[name],
  linewidth=lw, label='%s: error = %.3f' % (name, mse))
 
@@ -18,6 +18,7 @@
  lasso_path, enet_path, MultiTaskLasso,
  MultiTaskElasticNet, MultiTaskElasticNetCV,
  MultiTaskLassoCV)
+from .huber import HuberRegressor
 from .sgd_fast import Hinge, Log, ModifiedHuber, SquaredLoss, Huber
 from .stochastic_gradient import SGDClassifier, SGDRegressor
 from .ridge import (Ridge, RidgeCV, RidgeClassifier, RidgeClassifierCV,
@@ -39,7 +40,7 @@
  'ElasticNet',
  'ElasticNetCV',
  'Hinge',
- 'Huber',
+ 'HuberRegressor',
  'Lars',
  'LarsCV',
  'Lasso',