gradient | Data Stories

Some people use term bagging and boosting, some use parallel and sequential ensembles. I have written a post on AdaBoost previously. So we are familiar with boosting idea.

In AdaBoost we had a formula for generating labels (label/sample weight in loss function to be precise) for next iteration. In gradient boosting we fit gradient of loss function with respect to predictions.

Prof Balaraman notes

Most of this post contains info from [0]. However Prof. Balaraman offers intuitive perspective in one of his lectures. [3]
My handwritten note on that is available at [2]
We can write decision tree as equation where regions and value in the region are parameters.
Two steps of gradient boosting
1. Fit gradient of loss function using regression trees (least square loss)
2. Compute values in region using loss defined for application (ndcg for ranking, log-loss for classifier anything)
Whatever is application loss, we fit regression tree using least square
Parametric gradient descent can be seen as series of summation starting from initial guess.
Related paper [4]

Basis Functions

Basis functions
- If we want to fit non linear curve with simple linear regression, we will create polynomial features
- It is not limited to regression though
Adaptive Basis function
- Instead of we defining above functions, we learn them
- This can be framed as optimisation problem and can be solved by gradient descent/newton’s method for some hypothesis space
How to include Tree’s in base hypothesis space
- Something which we can not differentiate
- Even if we fix parameters to split on output is still not continuous
- So it is difficult

FSAM

FSAM
- Forward Stage-wise Additive Modelling
- AdaBoost is example of this, another example is L2 boosting
- We can say weak function is step direction and coefficient is step size

L2 Boosting also example of FSAM
- We are fitting residual
- To prevent overfitting we can reduce step size
- If v*h(x) is also in H, we can remove v. (closed under rescaling)
- h(x) can be decision stump we can keep going as long as residuals exist (It will certainly overfit after few iterations)

Adaboost
- It works for classification problem
- When the loss function in FSAM is exponential loss of margin, it reduces adaboost
- margin = y*f(x)
  - We had encountered that in SVM

Disadvantages of AdaBoost
- Gives too much penalty for negative margin, look at the curve of exponential above
- Mislabel training data
- Sometimes there is noise in world
- Does not work when there is high baye’s error
- Also it seems AdaBoost formulation is very much tied to classification
  - We probably needs to tweak it to apply it to ranking problem
  - But gradient boosting can be applied to any problem (loss function)
    - Because Tree is alway using L2 loss

Functional Gradient Descent

Gradient is pseudo residuals
- For L2 loss it is actually residuals (difference between y and f)
- Examples where residuals are not just difference between y and f
  - logistic loss
  - ranking using ndcg
We fit this gradient(pseudo-residuals) with weak function
- Regression Tree is most common
- We find closest approximation to gradient step in base hypothesis space (which is functional space)
- Here we minimize square loss (regression problem)
Earlier tree used to be just stumps
- That is num of leaf node = 2
- Now a days xgboost/light gbm takes upto 8/16 leaf nodes as well
Note that while calculating gradient the hypothesis function is not involved anywhere
How do we calculate gradient over function space
- In close form expression it might look complicated
- From numerical computation perspective we need gradient at just one point (of dimension training data size)
- This is just like taking gradient of weights in linear regression
  - Here size of gradient vector would be number of training examples
- In most of the method we take gradient of parameters, here we are taking gradient of predictions
What should be step size
- Line search
- Regularisation parameter (similar to what we use for any gradient based optimisation)
  - This is more common

Reference

[0] https://www.youtube.com/watch?v=fz1H03ZKvLM

[1] https://davidrosenberg.github.io/mlcourse/Archive/2016/Lectures/8b.gradient-boosting.pdf

[2] https://github.com/arcarchit/datastories/blob/master/notes/gradient_boost_iitm.pdf

[3] https://www.youtube.com/watch?v=FJ-MA0EfJ9k

[4] https://statweb.stanford.edu/~jhf/ftp/trebst.pdf

There are two things

Newton method for finding roots
1. Does not have second order derivative in denominator
Using 1 to solve for minima problem
1. Minima problem is same as finding root of derivative
2. Has second order derivative in denominator
There is a nice derivation starting from taylor series [3]

Newton’s Method

Based on Taylor series expansion
Advantages
- Convergence is rapid in general, quadratic near optimal point
- Insensitive to no of variables
- Performance does not depend on choice of parameter
  - Gradient method depends on learning rate
Disadvantages
- Cost of computing and storing Hessian
- Cost of computing single newton step
  - You need double derivative (Example in note is a simple root finding problem)

Gradient Descent

Very popular method and does not need any write up
Exhibits approximately linear convergence
Advantage
- Very simple to implement
Disadvantage
- Convergence rate depends on number of the Hessian
- Very slow when for large no of variables (say 1000 or more)
- Performance depends on choice of parameters like learning rate

Gradient Descent vs Netwon’s Method

Edit 1 : I believe term step is more suitable here. Gradient step and Newton step. Method is more suitable for interior point methods, active set methods, cutting plane methods and proximal methods.

Newton’s method for finding roots:

In optimization we are essentially finding roots of derivative.

We are approximating function using Taylor series. We are finding minimum of this approximated function. This root will be our new guess. This perspective is used in derivation.

Andrew N.G Lecture [2]

Gradient Descent	Newton
Simpler	Slightly more complex (Requires computing and inverting hessian)
Needs choice of learning rate alpha	No parameters (third point in image is optional )
Needs more iteration	Needs fewer iteration
Each iteration is cheaper O(n) where n is no of features	Each iteration is costly. Hessian is (n+1) * (n+1). Inverting matrix is roughly O(n^3)
Use when no of features are less (n<1000)	Use when (n > 10,000)

Gradient can only give you direction. If you want to know next point (or step size), second order derivative (hessian) is must.

Golden Section Search

Typically applicable for one dimension only
We used it to calculate mobile/tablet adjustments
- As a good practice we had avoided recursion and took at max 20 iteration breaking loop with some criterion
Applicable for strictly unimodal function
Three points that maintain golden ratio (phi)
Following two problems are different :
- Searching in rotated sorted array
- Finding a max on uni-modal function
- Formal has a property that left one end will be smaller than other, which later does not have
  - This property is used in binary search
What is special about golden ration ?
- https://www.quora.com/What-is-so-special-about-the-golden-ratio

Bisection method is okay to find root, but for finding extreme golden section is preferred
- http://www.aip.de/groups/soe/local/numres/bookcpdf/c10-1.pdf
Sample code :

	from math import sqrt
	phi = (1 + sqrt(5))/2
	resphi = 2 – phi

	# a and b are the current bounds; the minimum is between them.
	# c is the center pointer pushed slightly left towards a
	def goldenSectionSearch(f, a, c, b, absolutePrecision):
	if abs(a – b) < absolutePrecision:
	return (a + b)/2
	# Create a new possible center, in the area between c and b, pushed against c
	d = c + resphi*(b – c)
	if f(d) < f(c):
	return goldenSectionSearch(f, c, d, b, absolutePrecision)
	else:
	return goldenSectionSearch(f, d, c, a, absolutePrecision)

	f = lambda x: x**2

	def test_search():
	assert abs(goldenSectionSearch(f, -1, (-1 + resphi*2), 1, 1e-10)) < 1e-10
	print "OK!"

	if __name__ == '__main__':
	test_search()

view raw goldensection.py hosted with ❤ by GitHub

Reference

[0] http://www.ece.mcmaster.ca/~xwu/part4.pdf

*[1] https://www.youtube.com/watch?v=42zJ5xrdOqo

*[2] https://www.youtube.com/watch?v=iwO0JPt59YQ

[3] https://suzyahyah.github.io/calculus/optimization/2018/04/06/Taylor-Series-Newtons-Method.html

Data Stories

Tag gradient

Gradient Boosting

Prof Balaraman notes

Basis Functions

FSAM

Functional Gradient Descent

Reference

Iterative Method for Unconstrained Optimization

Newton’s Method

Gradient Descent

Gradient Descent vs Netwon’s Method

Golden Section Search

Reference