Gradient Boosting

Some people use term bagging and boosting, some use parallel and sequential ensembles. I have written a post on AdaBoost previously. So we are familiar with boosting idea.

In AdaBoost we had a formula for generating labels (label/sample weight in loss function to be precise) for next iteration. In gradient boosting we fit gradient of loss function with respect to predictions.

Prof Balaraman notes

  • Most of this post contains info from [0]. However Prof. Balaraman offers intuitive perspective in one of his lectures. [3]
  • My handwritten note on that is available at [2]
  • We can write decision tree as equation where regions and value in the region are parameters.
  • Two steps of gradient boosting
    1. Fit gradient of loss function using regression trees (least square loss)
    2. Compute values in region using loss defined for application (ndcg for ranking, log-loss for classifier anything)
  • Whatever is application loss, we fit regression tree using least square
  • Parametric gradient descent can be seen as series of summation starting from initial guess.
  • Related paper [4]

Basis Functions

  • Basis functions
    • If we want to fit non linear curve with simple linear regression, we will create polynomial features
    • It is not limited to regression though
  • Adaptive Basis function
    • Instead of we defining above functions, we learn them
    • This can be framed as optimisation problem and can be solved by gradient descent/newton’s method for some hypothesis space
  • How to include Tree’s in base hypothesis space
    • Something which we can not differentiate
    • Even if we fix parameters to split on output is still not continuous
    • So it is difficult

FSAM

  • FSAM
    • Forward Stage-wise Additive Modelling
    • AdaBoost is example of this, another example is L2 boosting
    • We can say weak function is step direction and coefficient is step size
  • L2 Boosting also example of FSAM
    • We are fitting residual
    • To prevent overfitting we can reduce step size
    • If v*h(x) is also in H, we can remove v. (closed under rescaling)
    • h(x) can be decision stump we can keep going as long as residuals exist (It will certainly overfit after few iterations)
  • Adaboost
    • It works for classification problem
    • When the loss function in FSAM is exponential loss of margin, it reduces adaboost
    • margin = y*f(x)
      • We had encountered that in SVM
  • Disadvantages of AdaBoost
    • Gives too much penalty for negative margin, look at the curve of exponential above
    • Mislabel training data
    • Sometimes there is noise in world
    • Does not work when there is high baye’s error
    • Also it seems AdaBoost formulation is very much tied to classification
      • We probably needs to tweak it to apply it to ranking problem
      • But gradient boosting can be applied to any problem (loss function)
        • Because Tree is alway using L2 loss
Slide from [1]

Functional Gradient Descent

  • Gradient is pseudo residuals
    • For L2 loss it is actually residuals (difference between y and f)
    • Examples where residuals are not just difference between y and f
      • logistic loss
      • ranking using ndcg
  • We fit this gradient(pseudo-residuals) with weak function
    • Regression Tree is most common
    • We find closest approximation to gradient step in base hypothesis space (which is functional space)
    • Here we minimize square loss (regression problem)
  • Earlier tree used to be just stumps
    • That is num of leaf node = 2
    • Now a days xgboost/light gbm takes upto 8/16 leaf nodes as well
  • Note that while calculating gradient the hypothesis function is not involved anywhere
  • How do we calculate gradient over function space
    • In close form expression it might look complicated
    • From numerical computation perspective we need gradient at just one point (of dimension training data size)
    • This is just like taking gradient of weights in linear regression
      • Here size of gradient vector would be number of training examples
    • In most of the method we take gradient of parameters, here we are taking gradient of predictions
  • What should be step size
    • Line search
    • Regularisation parameter (similar to what we use for any gradient based optimisation)
      • This is more common

Reference

[0] https://www.youtube.com/watch?v=fz1H03ZKvLM

[1] https://davidrosenberg.github.io/mlcourse/Archive/2016/Lectures/8b.gradient-boosting.pdf

[2] https://github.com/arcarchit/datastories/blob/master/notes/gradient_boost_iitm.pdf

[3] https://www.youtube.com/watch?v=FJ-MA0EfJ9k

[4] https://statweb.stanford.edu/~jhf/ftp/trebst.pdf

Ensemble, Bagging and Boosting

Ensembling is a method of combining more than one models to generate a final output.(Reference [1])

There are two ways of doing that:

  1. Bagging
  2. Boosting

 

Bagging Boosting
  • We take subset of data and train different models
  • Example Random forest
    • It takes subset of data as well as subset of features
  • Pros of random forest
    • Handles high dimensions
    • Handles missing values
  • Cons of random forest
    • It won’t give precise value regression because final value is mean from subset tress
    • None the less people are using it for regression depending upon domain
  • We train different model on with same data. Each sample is assigned different weight in each iteration
  • Example AdaBoost, XgBoost
  • Pros of XgBoost
    • Supports different loss function
    • Works well with interactions
  • Cons of XbBoost
    • Prone to overfitting
    • Tuning of hyper parameters is critical

 

Pros-cons of bagging vs boosting:

  • Bagging is easy to parallelize and hence training is faster
  • Boosting is more efficient for fixed no of iterations (classifiers)

 

AdaBoost vs XgBoost

Reference : [2]

ada_vs_xg

Quote from Tianqi Chen, one of the developers of XGBoost:

Adaboost and gradboosting [XGBoost] are two different ways to derive boosters. Both are generic. I like gradboosting better because it works for generic loss functions, while adaboost is derived mainly for classification with exponential loss.

 

Reference :

[1] https://towardsdatascience.com/decision-tree-ensembles-bagging-and-boosting-266a8ba60fd9

[2] https://www.packtpub.com/mapt/book/big_data_and_business_intelligence/9781788295758/4/ch04lvl1sec34/comparison-between-adaboosting-versus-gradient-boosting

 

AdaBoost

  • AdaBoost stands for adaptive boosting
  • Boosting is a technique for creating better classifier form several regular/weak/base classifiers
  • Thought:
    • We want to train second classifier in such a way that training data has more samples for the case where first classifier was wrong
    • Third classifier will have more samples where first and second classifier contradicts
  • How do we accomplish above:
    • We assign weights to each sample
      • This weights will be used in error function for each sample
      • Error function would be sum of weights of incorrectly classified samples
      • This weights keep changing in each iteration
    • We also assign wight to each base (weak) classifier (α In below algorithm)
      • Formula for calculating α below is valid for binary classification
        • It will be different for multi class classification
  • When do we stop iteration ?
    • Combined classifier is predicting all sample correctly
    • No more base classifier left
  • Most machine learning algorithms allows to apply weight to data
    • In logistic regression we can multiply weight in error fucntion
    • In decision tree, classification error can be calculated by summing weights of missclassified samples

 

adboost

 

Formulations in above algorithm:

  • Formula of α
    • if ε=0.5 then α=0. It is just a random chance classifier.
    • Else it can positive/negative.
    • ε < 0.5 => negative α => we invert the prediction in weighted average
  • Formula of D_(t+1)(i)
    • y*h is just a sign
      • y and h are  {-1, 1}
      • y*h will be positive for correct prediction
    • Why it is function of ε
      • Random classifier updating weights does not make sense
  • By Z_t we are normalizing weights, which ensures numerical stability

 

 

Adaboost Theorem

  • Error -> 0 as no of iteration -> infinity at certain condition
  • Condition is here is that each learner has error < 0.5
  • Each iteration means adding one weak learner

 

Over-fitting

  • Contrary to weak classifier (decision stumps) we saw that rate of increase test error is less with boosting
  • However no of iteration still should be selected via cross validation for small data-set and validation for large data-set

 

References:

Click to access MIT6_034F10_boosting.pdf

Click to access msri.pdf