Introduction of Xgboost

XGBoost XGBoost is the machine learning method based on “boosting algorithm” This method uses the many decision tree We don’t explain “decision tree”, please refer following URL: https://en.wikipedia.org/wiki/Decision_tree 2

Boosting algorithm(1) For a given dataset with n examples and m features, the explanatory variables 𝒙𝑖 are defined : Define the 𝑖 − 𝑡ℎ objective variable 𝑦𝑖; i = 1,2, ⋯ , 𝑛 4 𝒙𝑖 = 𝑥𝑖1, 𝑥𝑖2, ⋯ , 𝑥𝑖𝑚

Boosting algorithm(2) Define the output of 𝑡 − 𝑡ℎ decision tree: 𝑦𝑖 (𝑡) The error 𝜖𝑖 (1) between first decision tree’s output and objective variable 𝑦𝑖 is following: 5 𝜖𝑖 (1) = 𝑦𝑖 (1) − 𝑦𝑖

Boosting algorithm(3) The second decision tree predict 𝜖𝑖 (1) Define the second decision tree’s output 𝜖𝑖 (2) The predicted value 𝑦𝑖 (𝟐) is following: 6 𝑦𝑖 2 = 𝑦𝑖 1 + 𝜖𝑖 2

Boosting algorithm(4) The error 𝜖𝑖 (2) between predicted value using two decision tree and objective value is following: 7 𝜖𝑖 (2) = 𝑦𝑖 − 𝑦𝑖 2 = 𝑦𝑖 − 𝑦𝑖 1 + 𝜖𝑖 (2)

Boosting algorithm(5) Define the third decision tree’s predicted value 𝜖𝑖 (2) The predicted value 𝑦𝑖 (3) using three decision trees is: 8 𝑦𝑖 3 = 𝑦𝑖 2 + 𝜖𝑖 3 = 𝑦𝑖 1 + 𝜖𝑖 2 + 𝜖𝑖 3

Boosting algorithm(6) Construct a new model using the information of the model learned so far “Boosting” 9 * It is not Boosting algorithm to make error as objective variable

XGBoost XGBoost has been shown to give state-of-the-art results on many standard classification benchmarks More than half of the methods won by the Kaggle competition use XGBoost 11

XGBoost algorithm(1) Define the 𝑘 − 𝑡ℎ decision tree: 𝑓𝑘 The predicted value when boosting K times is as follows: 12 y𝑖 = 𝑘=1 𝐾 𝑓𝑘 𝑥𝑖

XGBoost algorithm(2) Define the loss function: Our purpose is minimize the following objective 13 𝑙 𝑦𝑖, 𝑦𝑖 ℒ 𝜙 = 𝑖=1 𝐼 𝑙 𝑦𝑖, 𝑦𝑖

XGBoost algorithm(3) Deformation of formula 14 min 𝑓𝑡 ℒ 𝑓𝑡 = min 𝑓𝑡 𝑖=1 𝐼 𝑙 𝑦𝑖, 𝑦𝑖 (𝑡) = min 𝑓𝑡 𝑖=1 𝐼 𝑙 𝑦𝑖, 𝑘=1 𝑡 𝑓𝑘 𝑥𝑖 = min 𝑓𝑡 𝑖=1 𝐼 𝑙 𝑦𝑖, 𝑦𝑖 (𝑡−1) + 𝑓𝑡 𝑥𝑖

XGBoost algorithm(4) Define the “penalizes function”: 𝛾 and 𝜆 is the hyper parameters 𝑇 is number of tree node 𝑤 is the vector of the nodes 15 𝛺 𝑓 = 𝛾𝑇 + 1 2 𝜆 𝒘 2

XGBoost algorithm(5) If we add the “penalizes function” to loss, it helps to smooth the final learnt weight to avoid over-fitting So, Our new purpose is minimize the following objective function ℒ 𝜙 : 16 ℒ 𝜙 = 𝑖=1 𝐼 𝑙 𝑦𝑖, 𝑦𝑖 + 𝑘=1 𝐾 𝛺 𝑓𝑘

XGBoost algorithm(6) Minimizing ℒ 𝜙 is same to minimizing all ℒ(𝑡) : 17 min 𝑓𝑡 ℒ(𝑡) = min 𝑓𝑡 𝑖=1 𝐼 𝑙 𝑦𝑖, 𝑦𝑖 (𝑡−1) + 𝑓𝑡 𝑥𝑖 + Ω 𝑓𝑡

XGBoost algorithm(7) Second-order approximation can be used to quickly optimize the objective : 18 ℒ(𝑡) = 𝑖=1 𝐼 𝑙 𝑦𝑖, 𝑦𝑖 (𝑡−1) + 𝑓𝑡 𝑥𝑖 + Ω 𝑓𝑡 Taylor expansion ℒ(𝑡) ≅ 𝑖=1 𝐼 𝑙 𝑦𝑖, 𝑦𝑖 (𝑡−1) + 𝑔𝑖 𝑓𝑡 𝑥𝑖 + 1 2 ℎ𝑖 𝑓𝑡 2 𝑥𝑖 + Ω 𝑓𝑡 𝑔𝑖 = 𝜕 𝑦 𝑡−1 𝑙 𝑦𝑖, 𝑦 𝑡−1 ℎ𝑖 = 𝜕 𝑦 𝑡−1 2 𝑙 𝑦𝑖, 𝑦 𝑡−1

XGBoost algorithm(8) We can remove the constant terms to obtain the following simplified objective at step 𝑡: 19 ℒ(𝑡) = 𝑖=1 𝐼 𝑔𝑖 𝑓𝑡 𝑥𝑖 + 1 2 ℎ𝑖 𝑓𝑡 2 𝑥𝑖 + Ω 𝑓𝑡 ℒ(𝑡) = 𝑖=1 𝐼 𝑙 𝑦𝑖, 𝑦𝑖 (𝑡−1) + 𝑔𝑖 𝑓𝑡 𝑥𝑖 + 1 2 ℎ𝑖 𝑓𝑡 2 𝑥𝑖 + Ω 𝑓𝑡

XGBoost algorithm(9) Define 𝐼𝑗 as the instance set of leaf j 20 Leaf 1 Leaf 2 Leaf 3Leaf 4 𝐼4

XGBoost algorithm(11) Deformation of formula 21 min 𝑓𝑡 ℒ(𝑡) = min 𝑓𝑡 𝑖=1 𝐼 𝑔𝑖 𝑓𝑡 𝑥𝑖 + 1 2 ℎ𝑖 𝑓𝑡 2 𝑥𝑖 + Ω 𝑓𝑡 = min 𝑓𝑡 𝑖=1 𝐼 𝑔𝑖 𝑓𝑡 𝑥𝑖 + 1 2 ℎ𝑖 𝑓𝑡 2 𝑥𝑖 + 𝛾𝑇 + 1 2 𝜆 𝑤 2 = min 𝑓𝑡 𝑗=1 𝑇 𝑖∈𝐼 𝑗 𝑔𝑖 𝑤𝑗 + 1 2 𝑖∈𝐼 𝑗 ℎ𝑖 + 𝜆 𝑤𝑗 2 + 𝛾𝑇 Quadratic function of w

XGBoost algorithm(12) We can solve the quadratic function ℒ(𝑡) on 𝑤𝑗 22 𝑤ℎ𝑒𝑟𝑒 𝑑ℒ(𝑡) 𝑑𝑤𝑗 = 0 𝑤𝑗 = − 𝑖∈𝐼 𝑗 𝑔𝑖 𝑖∈𝐼 𝑗 ℎ𝑖 + 𝜆

XGBoost algorithm(13) Remember 𝑔𝑖 and ℎ𝑖 is the inclination of loss function and they can calculate with the output of (𝑡 − 1)𝑡ℎ tree and 𝑦𝑖 So, we can calculate 𝑤𝑗 and minimizing ℒ 𝜙 23 𝑔𝑖 = 𝜕 𝑦 𝑡−1 𝑙 𝑦𝑖, 𝑦 𝑡−1 ℎ𝑖 = 𝜕 𝑦 𝑡−1 2 𝑙 𝑦𝑖, 𝑦 𝑡−1

How to split the node I told you how to minimize the loss. One of the key problems in tree learning is to find the best split. 25

XGBoost algorithm(14) Substitute 𝑤𝑗 for ℒ(𝑡) : In this equation, if 𝑖∈𝐼 𝑗 𝑔 𝑖 2 𝑖∈𝐼 𝑗 ℎ 𝑖+𝜆 in each node become bigger, ℒ(𝑡) become smaller. 26 ℒ(𝑡) = − 1 2 𝑗=1 𝑇 𝑖∈𝐼 𝑗 𝑔𝑖 2 𝑖∈𝐼 𝑗 ℎ𝑖 + 𝜆 + 𝛾𝑇

XGBoost algorithm(15) Compare the 𝑖∈𝐼 𝑗 𝑔 𝑖 2 𝑖∈𝐼 𝑗 ℎ 𝑖+𝜆 in before split node and after split node Define the objective function before split node : ℒ 𝑏𝑒𝑓𝑜𝑟𝑒 (𝑡) Define the objective function after split node : ℒ 𝑎𝑓𝑡𝑒𝑟 (𝑡) 27

XGBoost algorithm(16) ℒ 𝑏𝑒𝑓𝑜𝑟𝑒 (𝑡) = − 1 2 𝑗≠𝑠 𝑇 𝑖∈𝐼 𝑗 𝑔 𝑖 2 𝑖∈𝐼 𝑗 ℎ 𝑖+𝜆 − 1 2 𝑖∈𝐼 𝑠 𝑔 𝑖 2 𝑖∈𝐼 𝑠 ℎ 𝑖+𝜆 + 𝛾𝑇 ℒ 𝑎𝑓𝑡𝑒𝑟 (𝑡) = − 1 2 𝑗≠𝑠 𝑇 𝑖∈𝐼 𝑗 𝑔 𝑖 2 𝑖∈𝐼 𝑗 ℎ 𝑖+𝜆 − 1 2 𝑖∈𝐼 𝐿 𝑔 𝑖 2 𝑖∈𝐼 𝐿 ℎ 𝑖+𝜆 − 1 2 𝑖∈𝐼 𝑅 𝑔 𝑖 2 𝑖∈𝐼 𝑅 ℎ 𝑖+𝜆 + 𝛾𝑇 28 𝐼𝑠 𝐼𝐿 𝐼 𝑅 before after

XGBoost algorithm(17) After maximizing ℒ 𝑏𝑒𝑓𝑜𝑟𝑒 (𝑡) − ℒ 𝑎𝑓𝑡𝑒𝑟 (𝑡) we can get the minimizing ℒ(𝑡) 29

Introduction of Xgboost

In this document