Gradient Tree Boosting: XGBoost vs. LightGBM vs. CatBoost (Part 1)

Beverly Wang
4 min readApr 18, 2020

--

Photo by Richard Gatley on Unsplash

XGBoost, LightGBM and CatBoost are among the most common algorithms to use in competitions. To understand their differences , we will split this topic into three parts: Part 1 talks about the mathematics of Gradient Tree Boosting, Part 2 compares the differences among XGBoost, LightGBM and CatBoost, and Part 3 discusses the codes and exercises of each algorithm.

Let’s start with Part 1 today.

XGBoost, LightGBM and CatBoost are the competing algorithms in Gradient Tree Boosting. To better understand their differences in model training, we need to know how Gradient Tree Boosting works first. It is assumed that you have some knowledge about regression trees (CART) and the boosting algorithm. By reading the passage below, you will know:

1. Mathematical expression of a regression tree (CART)

2. Objective function of Gradient Tree Boosting

3. Criteria to split nodes in Gradient Tree Boosting

  1. Mathematical expression of a regression tree (CART)
Figure 1. Regression Tree (Edited from Tianqi’s Paper)

Figure 1 shows an example of a regression tree. Going through the tree, each observation will end up in one and only one leaf of the tree, where each leaf is assigned with one score.

How do we represent this tree mathematically?

We use a function f(.) to represent, where the input is a vector of i-th observation’s features and the output is the score of the leaf which i-th observation ends up in. Thus, the Mathematical Expression of a regression tree is:

2. Objective Function of Gradient Tree Boosting

Figure 2. Tree Ensemble Model of Two Trees (Obtained from Tianqi’s Paper)

Based on the logics of boosting, Gradient Tree Boosting makes predictions sequentially, where the subsequent predictor learns from the mistakes of the previous predictor. (See here to know more about boosting algorithm.) As shown in Figure 2, the final predicted output is the sum over the weights of an ensemble of trees:

The Objective Function of Gradient Tree Boosting is:

The aim of Gradient Tree Boosting is to minimize the objective function.

3. Criterion to Split Nodes in Gradient Tree Boosting

Given that the objective function is convex (as l(.) and \Omega(.) are both convex functions), using greedy algorithm can get the optimal solution. Greedy algorithm is to start from a single leaf and interactively split a leaf node of a tree to ensure each split is maximumly reducing the objective function.

Thus, the Criterion to Split Nodes in Gradient Tree Boosting is to obtain the maximum reduction in the objective function. This actually involves two steps:

  • Greedily add a tree
  • Greedily split a node

Before we talk about the optimal objective function by adding one more tree, let’s briefly retrospect how Gradient Tree Boosting trains a prediction model. It makes predictions sequentially, where the subsequent predictor learns from the mistakes of the previous predictor. Figure 3 shows the logics of Gradient Tree Boosting algorithm. The input data is trained through Tree 1, also called 1st iteration. The corresponding residuals are then trained through Tree 2, i.e. 2nd iteration. On and on, the corresponding residuals are trained through Tree t, i.e. t-th iteration.

Figure 3. Algorithm of Gradient Tree Boosting

The change in the objective function by adding one more tree, e.g. Tree t, is:

Using Taylor expansion, the change in the objective function can be further approximated by:

(If you are interested, here is how the equation is derived:

)

Since g_i and h_i are pre-determined in the previous t-1 iterations, the optimal value of the change in the objective function is:

Equation (5) is the maximum reduction in the objective function by adding a tree, but most importantly, it can be used as an impurity score to evaluate node splitting. In Figure 3, if we want to further split the node A in Tree t, then there will be a change in the optimal value of the change in the objective function:

Equation (7) is used to evaluate whether to split a node or not.

Reference:

  1. https://medium.com/mlreview/gradient-boosting-from-scratch-1e317ae4587d
  2. Chen, T., & Guestrin, C. (2016, August). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining (pp. 785–794).

--

--

No responses yet