Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Prettenhofer

Gradient Boosted Regression Trees scikit Peter Prettenhofer (@pprett) DataRobot Gilles Louppe (@glouppe) Universit´e de Li`ege, Belgium

Outline 1 Basics 2 Gradient Boosting 3 Gradient Boosting in Scikit-learn 4 Case Study: California housing

About us Peter • @pprett • Python & ML ∼ 6 years • sklearn dev since 2010 Gilles • @glouppe • PhD student (Li`ege, Belgium) • sklearn dev since 2011 Chief tree hugger

Machine Learning 101 • Data comes as... • A set of examples {(xi , yi )|0 ≤ i < n samples}, with • Feature vector x ∈ Rn features , and • Response y ∈ R (regression) or y ∈ {−1, 1} (classiﬁcation) • Goal is to... • Find a function ˆy = f (x) • Such that error L(y, ˆy) on new (unseen) x is minimal

Classiﬁcation and Regression Trees [Breiman et al, 1984] MedInc <= 5.04 MedInc <= 3.07 MedInc <= 6.82 AveRooms <= 4.31 AveOccup <= 2.37 1.62 1.16 2.79 1.88 AveOccup <= 2.74 MedInc <= 7.82 3.39 2.56 3.73 4.57 sklearn.tree.DecisionTreeClassifier|Regressor

Function approximation with Regression Trees 0 2 4 6 8 10 x 8 6 4 2 0 2 4 6 8 10y ground truth RT d=1 RT d=3 RT d=20

Function approximation with Regression Trees 0 2 4 6 8 10 x 8 6 4 2 0 2 4 6 8 10y ground truth RT d=1 RT d=3 RT d=20 Deprecated • Nowadays seldom used alone • Ensembles: Random Forest, Bagging, or Boosting (see sklearn.ensemble)

Gradient Boosted Regression Trees Advantages • Heterogeneous data (features measured on diﬀerent scale), • Supports diﬀerent loss functions (e.g. huber), • Automatically detects (non-linear) feature interactions, Disadvantages • Requires careful tuning • Slow to train (but fast to predict) • Cannot extrapolate

Boosting AdaBoost [Y. Freund & R. Schapire, 1995] • Ensemble: each member is an expert on the errors of its predecessor • Iteratively re-weights training examples based on errors 2 1 0 1 2 3 x0 2 1 0 1 2 x1 2 1 0 1 2 3 x0 2 1 0 1 2 3 x0 2 1 0 1 2 3 x0 sklearn.ensemble.AdaBoostClassifier|Regressor

Boosting AdaBoost [Y. Freund & R. Schapire, 1995] • Ensemble: each member is an expert on the errors of its predecessor • Iteratively re-weights training examples based on errors 2 1 0 1 2 3 x0 2 1 0 1 2 x1 2 1 0 1 2 3 x0 2 1 0 1 2 3 x0 2 1 0 1 2 3 x0 sklearn.ensemble.AdaBoostClassifier|Regressor Huge success • Viola-Jones Face Detector (2001) • Freund & Schapire won the G¨odel prize 2003

Gradient Boosting [J. Friedman, 1999] Statistical view on boosting • ⇒ Generalization of boosting to arbitrary loss functions

Gradient Boosting [J. Friedman, 1999] Statistical view on boosting • ⇒ Generalization of boosting to arbitrary loss functions Residual ﬁtting 2 6 10 x 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 2.5 y Ground truth 2 6 10 x ∼ tree 1 2 6 10 x + tree 2 2 6 10 x + tree 3 sklearn.ensemble.GradientBoostingClassifier|Regressor

Functional Gradient Descent Least Squares Regression • Squared loss: L(yi , f (xi )) = (yi − f (xi ))2 • The residual ∼ the (negative) gradient ∂L(yi , f (xi )) ∂f (xi )

Functional Gradient Descent Least Squares Regression • Squared loss: L(yi , f (xi )) = (yi − f (xi ))2 • The residual ∼ the (negative) gradient ∂L(yi , f (xi )) ∂f (xi ) Steepest Descent • Regression trees approximate the (negative) gradient • Each tree is a successive gradient descent step 4 3 2 1 0 1 2 3 4 y−f(x) 0 1 2 3 4 5 6 7 8 L(y,f(x)) Squared error Absolute error Huber error 4 3 2 1 0 1 2 3 4 y·f(x) 0 1 2 3 4 5 6 7 8 L(y,f(x)) Zero-one loss Log loss Exponential loss

GBRT in scikit-learn How to use it >>> from sklearn.ensemble import GradientBoostingClassifier >>> from sklearn.datasets import make_hastie_10_2 >>> X, y = make_hastie_10_2(n_samples=10000) >>> est = GradientBoostingClassifier(n_estimators=200, max_depth=3) >>> est.fit(X, y) ... >>> # get predictions >>> pred = est.predict(X) >>> est.predict_proba(X)[0] # class probabilities array([ 0.67, 0.33]) Implementation • Written in pure Python/Numpy (easy to extend). • Builds on top of sklearn.tree.DecisionTreeRegressor (Cython). • Custom node splitter that uses pre-sorting (better for shallow trees).

Example from sklearn.ensemble import GradientBoostingRegressor est = GradientBoostingRegressor(n_estimators=2000, max_depth=1).fit(X, y) for pred in est.staged_predict(X): plt.plot(X[:, 0], pred, color=’r’, alpha=0.1) 0 2 4 6 8 10 x 8 6 4 2 0 2 4 6 8 10 y High bias - low variance Low bias - high variance ground truth RT d=1 RT d=3 GBRT d=1

Model complexity & Overﬁtting test_score = np.empty(len(est.estimators_)) for i, pred in enumerate(est.staged_predict(X_test)): test_score[i] = est.loss_(y_test, pred) plt.plot(np.arange(n_estimators) + 1, test_score, label=’Test’) plt.plot(np.arange(n_estimators) + 1, est.train_score_, label=’Train’) 0 200 400 600 800 1000 n_estimators 0.0 0.5 1.0 1.5 2.0 Error Lowest test error train-test gap Test Train

Model complexity & Overﬁtting test_score = np.empty(len(est.estimators_)) for i, pred in enumerate(est.staged_predict(X_test)): test_score[i] = est.loss_(y_test, pred) plt.plot(np.arange(n_estimators) + 1, test_score, label=’Test’) plt.plot(np.arange(n_estimators) + 1, est.train_score_, label=’Train’) 0 200 400 600 800 1000 n_estimators 0.0 0.5 1.0 1.5 2.0 Error Lowest test error train-test gap Test Train Regularization GBRT provides a number of knobs to control overﬁtting • Tree structure • Shrinkage • Stochastic Gradient Boosting

Regularization: Tree structure • The max depth of the trees controls the degree of features interactions • Use min samples leaf to have a suﬃcient nr. of samples per leaf.

Regularization: Shrinkage • Slow learning by shrinking tree predictions with 0 < learning rate <= 1 • Lower learning rate requires higher n estimators 0 200 400 600 800 1000 n_estimators 0.0 0.5 1.0 1.5 2.0 Error Requires more trees Lower test error Test Train Test learning_rate=0.1 Train learning_rate=0.1

Regularization: Stochastic Gradient Boosting • Samples: random subset of the training set (subsample) • Features: random subset of features (max features) • Improved accuracy – reduced runtime 0 200 400 600 800 1000 n_estimators 0.0 0.5 1.0 1.5 2.0 Error Even lower test error Subsample alone does poorly Train Test Train subsample=0.5, learning_rate=0.1 Test subsample=0.5, learning_rate=0.1

Hyperparameter tuning 1. Set n estimators as high as possible (eg. 3000) 2. Tune hyperparameters via grid search. from sklearn.grid_search import GridSearchCV param_grid = {’learning_rate’: [0.1, 0.05, 0.02, 0.01], ’max_depth’: [4, 6], ’min_samples_leaf’: [3, 5, 9, 17], ’max_features’: [1.0, 0.3, 0.1]} est = GradientBoostingRegressor(n_estimators=3000) gs_cv = GridSearchCV(est, param_grid).fit(X, y) # best hyperparameter setting gs_cv.best_params_ 3. Finally, set n estimators even higher and tune learning rate.

Case Study California Housing dataset • Predict log(medianHouseValue) • Block groups in 1990 census • 20.640 groups with 8 features (median income, median age, lat, lon, ...) • Evaluation: Mean absolute error on 80/20 split Challenges • Heterogeneous features • Non-linear interactions

Predictive accuracy & runtime Train time [s] Test time [ms] MAE Mean - - 0.4635 Ridge 0.006 0.11 0.2756 SVR 28.0 2000.00 0.1888 RF 26.3 605.00 0.1620 GBRT 192.0 439.00 0.1438 0 500 1000 1500 2000 2500 3000 n_estimators 0.0 0.1 0.2 0.3 0.4 0.5 error Test Train

Model interpretation Which features are important? >>> est.feature_importances_ array([ 0.01, 0.38, ...]) 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 Relative importance HouseAge Population AveBedrms Latitude AveOccup Longitude AveRooms MedInc

Model interpretation What is the eﬀect of a feature on the response? from sklearn.ensemble import partial_dependence import as pd features = [’MedInc’, ’AveOccup’, ’HouseAge’, ’AveRooms’, (’AveOccup’, ’HouseAge’)] fig, axs = pd.plot_partial_dependence(est, X_train, features, feature_names=names) 1.5 3.0 4.5 6.0 7.5 MedInc 0.4 0.2 0.0 0.2 0.4 0.6 Partialdependence 2.0 2.5 3.0 3.54.0 4.5 AveOccup 0.4 0.2 0.0 0.2 0.4 0.6 Partialdependence 10 20 30 40 50 60 HouseAge 0.4 0.2 0.0 0.2 0.4 0.6 Partialdependence 4 5 6 7 8 AveRooms 0.4 0.2 0.0 0.2 0.4 0.6 Partialdependence 2.0 2.5 3.0 3.5 4.0 AveOccup 10 20 30 40 50 HouseAge -0.12 -0.05 0.02 0.090.16 0.23 Partial dependence of house value on nonlocation features for the California housing dataset

Model interpretation Automatically detects spatial eﬀects longitude latitude -1.54 -1.22 -0.91 -0.60 -0.28 0.03 0.34 0.66 0.97 partialdep.onmedianhousevalue longitudelatitude -0.15 -0.07 0.01 0.09 0.17 0.25 0.33 0.41 0.49 0.57 partialdep.onmedianhousevalue

Summary • Flexible non-parametric classiﬁcation and regression technique • Applicable to a variety of problems • Solid, battle-worn implementation in scikit-learn

Benchmarks 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Error gbm sklearn-0.15 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Traintime Arcene Boston California Covtype Example10.2 Expedia Madelon Solar Spam YahooLTRC bioresp dataset 0.0 0.2 0.4 0.6 0.8 1.0 Testtime

Tipps & Tricks 1 Input layout Use dtype=np.ﬂoat32 to avoid memory copies and fortan layout for slight runtime beneﬁt. X = np.asfortranarray(X, dtype=np.float32)

Tipps & Tricks 2 Feature interactions GBRT automatically detects feature interactions but often explicit interactions help. Trees required to approximate X1 − X2: 10 (left), 1000 (right). x 0.00.20.40.60.81.0 y 0.0 0.2 0.4 0.6 0.8 1.0 x-y 0.3 0.2 0.1 0.0 0.1 0.2 0.3 x 0.00.20.40.60.81.0 y 0.0 0.2 0.4 0.6 0.8 1.0 x-y 1.0 0.5 0.0 0.5 1.0

Tipps & Tricks 3 Categorical variables Sklearn requires that categorical variables are encoded as numerics. Tree-based methods work well with ordinal encoding: df = pd.DataFrame(data={’icao’: [’CRJ2’, ’A380’, ’B737’, ’B737’]}) # ordinal encoding df_enc = pd.DataFrame(data={’icao’: np.unique(df.icao, return_inverse=True)[1]}) X = np.asfortranarray(df_enc.values, dtype=np.float32)

Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Prettenhofer

More Related Content

What's hot

Similar to Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Prettenhofer

More from PyData

Recently uploaded

Gradient Boosted Regression Trees in Scikit Learn by Gilles Louppe & Peter Prettenhofer