Gradient Boosting

Some people use term bagging and boosting, some use parallel and sequential ensembles. I have written a post on AdaBoost previously. So we are familiar with boosting idea.

In AdaBoost we had a formula for generating labels (label/sample weight in loss function to be precise) for next iteration. In gradient boosting we fit gradient of loss function with respect to predictions.

Prof Balaraman notes

  • Most of this post contains info from [0]. However Prof. Balaraman offers intuitive perspective in one of his lectures. [3]
  • My handwritten note on that is available at [2]
  • We can write decision tree as equation where regions and value in the region are parameters.
  • Two steps of gradient boosting
    1. Fit gradient of loss function using regression trees (least square loss)
    2. Compute values in region using loss defined for application (ndcg for ranking, log-loss for classifier anything)
  • Whatever is application loss, we fit regression tree using least square
  • Parametric gradient descent can be seen as series of summation starting from initial guess.
  • Related paper [4]

Basis Functions

  • Basis functions
    • If we want to fit non linear curve with simple linear regression, we will create polynomial features
    • It is not limited to regression though
  • Adaptive Basis function
    • Instead of we defining above functions, we learn them
    • This can be framed as optimisation problem and can be solved by gradient descent/newton’s method for some hypothesis space
  • How to include Tree’s in base hypothesis space
    • Something which we can not differentiate
    • Even if we fix parameters to split on output is still not continuous
    • So it is difficult

FSAM

  • FSAM
    • Forward Stage-wise Additive Modelling
    • AdaBoost is example of this, another example is L2 boosting
    • We can say weak function is step direction and coefficient is step size
  • L2 Boosting also example of FSAM
    • We are fitting residual
    • To prevent overfitting we can reduce step size
    • If v*h(x) is also in H, we can remove v. (closed under rescaling)
    • h(x) can be decision stump we can keep going as long as residuals exist (It will certainly overfit after few iterations)
  • Adaboost
    • It works for classification problem
    • When the loss function in FSAM is exponential loss of margin, it reduces adaboost
    • margin = y*f(x)
      • We had encountered that in SVM
  • Disadvantages of AdaBoost
    • Gives too much penalty for negative margin, look at the curve of exponential above
    • Mislabel training data
    • Sometimes there is noise in world
    • Does not work when there is high baye’s error
    • Also it seems AdaBoost formulation is very much tied to classification
      • We probably needs to tweak it to apply it to ranking problem
      • But gradient boosting can be applied to any problem (loss function)
        • Because Tree is alway using L2 loss
Slide from [1]

Functional Gradient Descent

  • Gradient is pseudo residuals
    • For L2 loss it is actually residuals (difference between y and f)
    • Examples where residuals are not just difference between y and f
      • logistic loss
      • ranking using ndcg
  • We fit this gradient(pseudo-residuals) with weak function
    • Regression Tree is most common
    • We find closest approximation to gradient step in base hypothesis space (which is functional space)
    • Here we minimize square loss (regression problem)
  • Earlier tree used to be just stumps
    • That is num of leaf node = 2
    • Now a days xgboost/light gbm takes upto 8/16 leaf nodes as well
  • Note that while calculating gradient the hypothesis function is not involved anywhere
  • How do we calculate gradient over function space
    • In close form expression it might look complicated
    • From numerical computation perspective we need gradient at just one point (of dimension training data size)
    • This is just like taking gradient of weights in linear regression
      • Here size of gradient vector would be number of training examples
    • In most of the method we take gradient of parameters, here we are taking gradient of predictions
  • What should be step size
    • Line search
    • Regularisation parameter (similar to what we use for any gradient based optimisation)
      • This is more common

Reference

[0] https://www.youtube.com/watch?v=fz1H03ZKvLM

[1] https://davidrosenberg.github.io/mlcourse/Archive/2016/Lectures/8b.gradient-boosting.pdf

[2] https://github.com/arcarchit/datastories/blob/master/notes/gradient_boost_iitm.pdf

[3] https://www.youtube.com/watch?v=FJ-MA0EfJ9k

[4] https://statweb.stanford.edu/~jhf/ftp/trebst.pdf

log-linear and log-log regression

In linear regression taking log is popular way to make relationship linear. We can take log of either response or predictor or both. This gives us four classes [0]

log_llog

Interpretations of β

  • Linear Model
    • the coefficient β gives us directly the change in Y for a one-unit change in X
  • Linear Log Model
    • β is the expected change in Y when X is multiplied by e (natural log)
  • Log Linear Model
    • Each 1-unit increase in X multiplies the expected value of Y by e β
  • Log Log Model
    • multiplying X by e will multiply expected value of Y by e βˆ

I have coded notebook to see the curves for all four at [1]. 

A note on Normalisation

  • Suppose you need to normalise a data to bring it between 0 and 1. This will be feed into some linear function (say ranking function without supervised response y). If you data is exponentially distributed instead of dividing by max, you can try log(sample)/log(max).  This is like feature transformation by taking log and then normalising it.
    • That’s about normalising variable which seems to have exponential distribution
  • Also when you check correlation with response variable and you see plots like [1] you know which transformation to take.
References :

[0] https://kenbenoit.net/assets/courses/ME104/logmodels2.pdf
[1] https://github.com/arcarchit/datastories/blob/master/notebooks/log_linear_models.ipynb

Oversampling and Under-sampling

When data is class-imbalanced there is a tendency to predict majority class. One way to tackle this would be apply more weight to minority classes in cost function. Another way is oversampling and under-smapling.

  • Over-sampling makes duplicate copies of minority classes
  • Under sampling randomly removes some samples from majority class
    • This should be used with caution
    • We need to check once that we still remain with enough sample for a given no of features
  • Practically we might want to over sample some classes and under-sample others.

 

Cross validation

  • Validation set should be taken out from original data[1]
    • We can do the sampling just before training only on training data

 

Reference

[0] : https://www.kaggle.com/residentmario/undersampling-and-oversampling-imbalanced-data

[1] : https://www.marcoaltini.com/blog/dealing-with-imbalanced-data-undersampling-oversampling-and-proper-cross-validation

 

Generative and Discriminative Models

Introduction

In machine learning, there are two broad categories of models: generative models and discriminative models. In this blog post, we will discuss some popular examples of each and their capabilities.

Generative Models

  1. Naive Bayes Classifier
  2. Hidden Markov Models
  3. Latent Dirichlet Allocation
  4. Boltzmann Machine
  5. Gaussian Mixture Model ( Unsupervised clustering)

Generative Models Explanation

Generative models allow us to generate datasets for specific classes, labels, or clusters. They estimate the joint probability distribution P(X, y). An example of this is the Naive Bayes Classifier, where we have a probability associated with each (X, y) pair. For classification, we predict y based on the highest probability given x.

𝑎𝑟𝑔𝑚𝑎𝑥𝑦𝑃(𝑌=𝑦|𝑋=𝑥)=
𝑎𝑟𝑔𝑚𝑎𝑥𝑦𝑃(𝑌=𝑦,𝑋=𝑥)/𝑃(𝑋=𝑥) = 𝑎𝑟𝑔𝑚𝑎𝑥𝑦𝑃(𝑌=𝑦,𝑋=𝑥)/𝑃(𝑋=𝑥)

Generative models offer more than just prediction. They can be used to:

  • Impute missing data
  • Compress datasets
  • Generate unseen data

Discriminative Models

  1. Logistic Regression
  2. Support Vector Machines
  3. Decision Trees

Discriminative Models Explanation

Discriminative models focus on learning the boundaries that separate different classes directly. They do not model the entire joint probability distribution but instead estimate the conditional probability P(y|x). Examples of discriminative models include Logistic Regression, Support Vector Machines, and Decision Trees.

Which ML algorithm to use when?

We generally find this question while starting up new project or we want to compare some algorithms to discriminate them (discrimination helps understand things better sometimes).

Although there is no definitive answer to this, I am writing here summaries from some posts. [0]

 

machine-learning-cheet-sheet

 

Factors to consider

  • Accuracy
    • Most of people focus on accuracy but it practically it is not the only
  • Training time
    • Naive Bayes and logistic regression are much faster than boosting or neural nets
  • Linearity
    • LR and SVM are suitable when classes are linearly seperatble
      • Of course SVM bypasses it via kernel trick but still not as much complex decision boundary as nueral nets
    • Despite the risk of non linearity in data linear algorithms tends to work well in practice and are often used as starting point
  • Number of parameters
    • Parameters does affect training time and accuracy
    • More parameters helps learning complex function, however it requires more data to prevent over-fitting
  • No of features
    • When data point are not enough for no of features (text, NLP) SVM works well

 

Notes

  • Try out linear/logistic regression, SVM first when you most dependent variables are numeric.
  • SVM
    • SVM suites more when no of data points are less for given no of features.
    • SVM is linear classifier only. It just uses kernel trick to project linearly inseparable data on high dimension.
    • SVM is solved by mathematical optimization problem unlike nueral nets. Hence tends to be bit faster.
  • What is the difference between LR and SVM?
    • LR has linear decision boundary while SVM can have non linear decision boundary.
  • Reinforcement learning
    • Analyses and optimized behavior of agent, (via feedback from environment)
    • They try to discover different actions to maximize reward
    • Trial-error and delayed reward distinguishes reinforcement learning from other ML algorithms

 

References

[0] : https://blogs.sas.com/content/subconsciousmusings/2017/04/12/machine-learning-algorithm-use/#prettyPhoto

[1] : https://docs.microsoft.com/en-us/azure/machine-learning/studio/algorithm-choice

 

Naive Bayes Classifier

  • There are two things[1]
    • Probability model
    • Classification model

Probability Model

A probability model is an extension of Bayes’ rule. It makes two assumptions:

  1. Independence of Features: This assumption assumes that all features are independent of each other. However, it does not hold true in many cases. For example, having higher temperature does not necessarily imply higher humidity.
  2. Equal Weight of Features: This assumption assumes that all features have equal importance or weight in the model.
b1
b2

Classification Model

The classification model involves the following steps:

  1. Probability of Each Class: P(y) represents the probability of each class based on the training set.
  2. Probability Estimation of Feature Values: The goal is to estimate the probability distribution of each feature value given a specific class, denoted as P(x_i|y). For discrete features, this can be achieved through simple probability calculations, such as multinomial Naive Bayes. For continuous features, Gaussian distributions can be used. In the case of count data, multinomial distributions are suitable.
  3. Parameter Estimation: Parameter estimation is performed for each combination of class and feature.
  4. Scikit-learn and Distribution Types: Scikit-learn library provides implementations of Gaussian Naive Bayes, Bernoulli Naive Bayes, and multinomial Naive Bayes classifiers. These classifiers refer to the distribution of features. It is important to note that different features can follow different distributions. Therefore, customization of the distribution based on the application may be necessary.
b3
b4

Advantages

  • Fast and Easy Implementation: Naive Bayes classifiers are known for their simplicity and efficiency in implementation.
  • Acceptable Classification Performance: While Naive Bayes classifiers may not always accurately predict probabilities, their classification performance is generally satisfactory.

Disadvantage

  • Independence Assumption: The assumption of feature independence does not hold true in all scenarios, which can affect the model’s accuracy.

Reference

[0] https://towardsdatascience.com/naive-bayes-classifier-81d512f50a7c

[1] https://en.wikipedia.org/wiki/Naive_Bayes_classifier

Linear Regression Derivations

  • Suppose you have just two parameters a and b and have n observation
    • y = a + b*x
    • a and b can be found by simple optimization problem [1]
      • By setting partial derivative to 0
      • br2derivation

 

Simpler R2 formula is available at https://datastoriesweb.wordpress.com/2017/01/15/interpreting-statistical-values/

 

Matrix closed form Solution

gradient                           closed_form

 

Gradient Descent

  • We assume some initial values for parameters
  • We calculate gradient for each parameter and move in the negative direction of it by step size
  • descent

 

I have coded gradient descent here at [2]

 

 

References :

[1] http://seismo.berkeley.edu/~kirchner/eps_120/Toolkits/Toolkit_10.pdf

[2] https://github.com/arcarchit/datastories/blob/master/gradient_descent.ipynb

Bayesian Learning – Quick Summary

We had used Bayesian learning for house price prediction project, notebook is available at [0]. Purpose of this blog is to have quick summary of concept involved.

  • Bayesian learning allows us to have distribution for parameters rather than point estimate.
  • We keep sampling values of parameters say 10000 times. Histogram of last 6000 sample represent approximate posterior of parameter.
  • Every parameters, latent variables, output have distribution associated with them.
  • Top parameters in the hierarchy will have prior associated with them.
    • We sample for these top parameters
    • Calculate corresponding value for all latent variable coming down the tree
    • Finally at output calculate likelihood against observed data
  • Different MCMC algorithms differs in two steps:
    • How to jump to next to next sample
    • How to decide whether next sample is acceptable or not
  • Metropolis algorithm:
    • Jumps considering normal distribution at previous parameter value and some fixed standard deviation
    • Acceptance test : random(0,1) < ratio of (Pnew/Pold)
      • When Pnew is higher we will definitely accept
      • Else probability of acceptance depends on how large Pold is
      • If Pold = 2 * Pnew it is 50 %
    • Pnew = liklihood_new * prior_new
    • Pold = liklihood_old * prior_old
  • Example of other MCMC algorithms:
    • Metropolis-Hastings
    • The Gibbs Sampler
    • Hamiltonian MCMC
    • The No-U-Turn Sampler (NUTS)

Reference

[0] : https://github.com/arcarchit/House-Prices-Advanced-Regression-Techniques/blob/master/notebooks/Bayesian%2BMaster.ipynb

[1] : http://twiecki.github.io/blog/2015/11/10/mcmc-sampling/

[2] : https://www.quantstart.com/articles/Bayesian-Inference-of-a-Binomial-Proportion-The-Analytical-Approach

Overfitting in Decision Trees

We had previously written about classification trees and regression trees.

In this blog we will mainly write about preventing over fitting in decision trees.

Handling missing values for decision tree is described here.

Over-fitting

  • For decision trees as depth increases training error always go towards zero.
  • Occam’s razor philosophy
    • Suppose there exists two explanation for an occurrence
    • Then less complex explanation is generally correct
    • You have headache and stomachache.
    • Disease D1 explains headache, D2 explains stomachache, while D3 explains both
    • Instead of diagnosing patient with both D1 and D2, saying it D3 would be more accurate in general
  • So when you have two trees with same validation error, choose one with lesser complexity
  • There are two approaches:
    • Early Stopping
    • Pruning

Early Stopping

There can be three stopping condition

  1. Pre define max_depth
    1. It can be tuned via cross validation, which increases training time
    2. Does not work well when you have less data
  2. Stop if decrease in classification error is less than some threshold
    1. It can become tricky as in case of XOR
    2. XOR is very famous case in classification literature 
  3. Stop if very few point are left in the node
    1. This works pretty well in practice
    2. Stop if node has say 10 to 100 sample

Pruning

  • Pruning generally is better solution than early stopping
  • Here we build the entire tree first and than remove certain node
  • Criterion for removing nodes:
    • cost(tree) = error(tree) + λ*L(tree)
    • L(tree) = no of leaf nodes
  • if cost(deep_tree) > cost(small_tree) then don’t split that node
  • Algorithm:
    • Start from bottom up the tree and traverse up and test one node each time
    • Calculate cost
    • Make decision to prune it or not

References:

Course by University of Washington

https://www.coursera.org/learn/ml-classification

Handling missing values in Decision Tree

We had previously written about classification trees and regression trees.

In this blog we will mainly write about handling missing values specifically for decision trees.

Handling over-fitting for decision tree is described here.

Missing Values

Missing value can cause issues at :

  1. Training time
  2. Prediction time
    1. What if while prediction sample does not have value for some feature

Purification by Skipping

  • This approach involves skipping rows or entire features that contain missing values.
  • However, this method comes with a drawback as it results in the loss of valuable information.
  • It is important to note that skipping missing values during training will not resolve the issue during prediction time

Imputation

  • Imputation refers to the process of filling in missing values with estimated or imputed values.
  • It provides a means to address missing values during both training and prediction.
  • A dictionary can be created to store imputed values for each feature, which can then be utilized during prediction.
  • For categorical features, a common imputation method is to use the mode, while for numeric features, the average or median is often employed.
  • Advanced techniques, such as the expectation maximization algorithm, can also be utilized for imputation.

However, it’s essential to be aware that imputation can introduce systematic biases into the data. For example, let’s consider a scenario where people from Washington tend to not provide their age, while the rest of the United States does. This could lead to a systematic bias in the imputed values, as the imputation process may assign ages based on data from other regions.

Algo specific technique

In decision tree algorithms, there are specific techniques to handle missing values. At each node in the decision tree, we need to determine which branch the missing value will take. It’s possible for the same feature to have different values at different nodes. To address this, we can make a tweak in the decision tree algorithm:

  1. During feature selection, consider assigning the missing value to each possible branch.
  2. Evaluate the effect of assigning the missing value to each branch by measuring the reduction in classification error.
  3. Choose the feature and branch that result in the highest reduction in classification error when the missing value is assigned.

However, what if we encounter a node during prediction where we didn’t have any missing values during training? In such cases, we can store the direction with the most number of samples at that node as the default direction for missing values. Let’s call this direction “max_sample.”

If we encounter missing values while training the decision tree, we can override the default “max_sample” direction with the observed direction for the missing values.

By incorporating these techniques, we can effectively handle missing values in decision trees and make informed decisions based on the available data.

References:

Course by University of Washington

https://www.coursera.org/learn/ml-classification