Gradient Boosting: Intuition behind the algorithm

Published in

Analytics Vidhya

3 min readMay 16, 2020

If you are an ML enthusiast then you must have heard the name gradient boosting. Ummm… if not, then no problem you landed at the right place.

Gradient boosting is very popular and widely used in the field of ML. It is a boosting technique which can be applied with classification or regression problem.

Boosting is a method for creating an ensemble. It starts by fitting an initial model (e.g. a tree or linear regression) to the data. Then a second model is built that focuses on accurately predicting the cases where the first model performs poorly. The combination of these two models is expected to be better than either model alone. Then you repeat this process of boosting many times. Each successive model attempts to correct for the shortcomings of the combined boosted ensemble of all previous models.

The main intuition behind the algorithm is that the best possible next model, when combined with previous models, minimizes the overall prediction error. The key idea is to set the target outcomes for this next model to minimize the error. Let’s understand this with an example of data for the regression problem.

In this problem, we want to predict Salary (Target) based on Experience and Degree (Independent Variables) of a candidate.

In the case of regression we first create our base model which will be the average of all the actual output.

In our case it would be 50 + 70 + 80 + 100 / 4 = 75

So this base model will give output 75k for any next prediction.

Next, we will calculate Pseudo Residual which would be

actual(Salary) — Predicted.

In the next step, we will create a decision tree by taking independent variables (Experience, Degree) as input and Residual R1 as output.

After this step we will have two models. First is the base model and the second is we created by residual R1. Now we can do current prediction by adding values from both models. Let’s look at the predicted value for (Experience =2 AND Degree = GRADUATE)

Base Model + M(R1) = 75 + (-25) = 50 (Equals to Actual)

As we can see that predicted value is equal to the actual value means the model is overfitting or we can say our model has low bias and high variance.

To overcome this problem algorithm use parameter alpha(@) which is called learning rate. The value of the learning rate lies between (0,1). So applying this our next prediction assuming (@ = 0.1) would be: -

Base Model + (@) * M(R1) = 75 + (0.1)(-25) = 72.5 (Actual = 50)

In the next step, we will again calculate residuals and then predict values using all the weak learners. Let’s look at the next step below

From this state, our prediction would be –

Base Model + (@) * M(R1) + (@) * M(R2) = 75 + (0.1)(-25) + (0.1)(-23)= 70.2 (Actual = 50)

You have observed it that as we are going further and adding more weak learners our residuals are getting decreased and we are predicted value is approaching actual value which is our motto for this problem. To generalize this we can write an equation –

F(x) = h0(x) + @1 * h1(x) + @2 * h2(x)…………………. + @n * hn(x).

I hope this will clarify the intuition behind the Gradient boosting algorithm. We can further explore mathematical implementation for getting in-depth.

If you liked the article, a clap would be highly appreciated.

Gradient Boosting: Intuition behind the algorithm

Written by Shubham Baghel