Ensemble, Bagging and Boosting

Ensembling is a method of combining more than one models to generate a final output.(Reference [1])

There are two ways of doing that:

  1. Bagging
  2. Boosting

 

Bagging Boosting
  • We take subset of data and train different models
  • Example Random forest
    • It takes subset of data as well as subset of features
  • Pros of random forest
    • Handles high dimensions
    • Handles missing values
  • Cons of random forest
    • It won’t give precise value regression because final value is mean from subset tress
    • None the less people are using it for regression depending upon domain
  • We train different model on with same data. Each sample is assigned different weight in each iteration
  • Example AdaBoost, XgBoost
  • Pros of XgBoost
    • Supports different loss function
    • Works well with interactions
  • Cons of XbBoost
    • Prone to overfitting
    • Tuning of hyper parameters is critical

 

Pros-cons of bagging vs boosting:

  • Bagging is easy to parallelize and hence training is faster
  • Boosting is more efficient for fixed no of iterations (classifiers)

 

AdaBoost vs XgBoost

Reference : [2]

ada_vs_xg

Quote from Tianqi Chen, one of the developers of XGBoost:

Adaboost and gradboosting [XGBoost] are two different ways to derive boosters. Both are generic. I like gradboosting better because it works for generic loss functions, while adaboost is derived mainly for classification with exponential loss.

 

Reference :

[1] https://towardsdatascience.com/decision-tree-ensembles-bagging-and-boosting-266a8ba60fd9

[2] https://www.packtpub.com/mapt/book/big_data_and_business_intelligence/9781788295758/4/ch04lvl1sec34/comparison-between-adaboosting-versus-gradient-boosting

 

AdaBoost

  • AdaBoost stands for adaptive boosting
  • Boosting is a technique for creating better classifier form several regular/weak/base classifiers
  • Thought:
    • We want to train second classifier in such a way that training data has more samples for the case where first classifier was wrong
    • Third classifier will have more samples where first and second classifier contradicts
  • How do we accomplish above:
    • We assign weights to each sample
      • This weights will be used in error function for each sample
      • Error function would be sum of weights of incorrectly classified samples
      • This weights keep changing in each iteration
    • We also assign wight to each base (weak) classifier (α In below algorithm)
      • Formula for calculating α below is valid for binary classification
        • It will be different for multi class classification
  • When do we stop iteration ?
    • Combined classifier is predicting all sample correctly
    • No more base classifier left
  • Most machine learning algorithms allows to apply weight to data
    • In logistic regression we can multiply weight in error fucntion
    • In decision tree, classification error can be calculated by summing weights of missclassified samples

 

adboost

 

Formulations in above algorithm:

  • Formula of α
    • if ε=0.5 then α=0. It is just a random chance classifier.
    • Else it can positive/negative.
    • ε < 0.5 => negative α => we invert the prediction in weighted average
  • Formula of D_(t+1)(i)
    • y*h is just a sign
      • y and h are  {-1, 1}
      • y*h will be positive for correct prediction
    • Why it is function of ε
      • Random classifier updating weights does not make sense
  • By Z_t we are normalizing weights, which ensures numerical stability

 

 

Adaboost Theorem

  • Error -> 0 as no of iteration -> infinity at certain condition
  • Condition is here is that each learner has error < 0.5
  • Each iteration means adding one weak learner

 

Over-fitting

  • Contrary to weak classifier (decision stumps) we saw that rate of increase test error is less with boosting
  • However no of iteration still should be selected via cross validation for small data-set and validation for large data-set

 

References:

Click to access MIT6_034F10_boosting.pdf

Click to access msri.pdf

 

Regression Trees

Figure below shows how decision tree creates rectangle in predictor space:

As we have describe classification tree, the only difference here is how to come up with output instead of label and how to define a metric other than entropy/gini.

For output it uses the mean of leaves.

For metric CART methodology uses simple SSE.

  • SSE = ∑ (y_i – y1)² + ∑(y_i – y2)²
    • where y1 and y2 are means of two newly crated groups.
    • It is simple sum of squares, not standard deviation 
    • While deriving R2 in linear recession we were concerned with sum of squares
    • Ward linkage criterion in hierarchical clustering also uses difference in sum of squares
  • diff = SSE_before – SSE_after
    • We choose a predictor which minimizes SSE the most.
    • diff is highest

My hand-coded regression tree is available at : https://github.com/arcarchit/datastories/tree/master/regression_tree

Reference

Applied predictive modeling by Max Kuhn and Kjell Johnson

Classification Trees

How classification Tree creates Rectangle in Predictor Space

Criterion for Choosing the Splitting Predictor

When deciding which predictor to split on, we need to consider different types of predictors: continuous, binary, and categorical. In this post, we will focus on continuous and binary predictors.

For categorical predictors, one approach is to use the “one vs rest” strategy. We evaluate each category separately and observe which one reduces the randomness the most when split.

In the case of a continuous predictor, we first sort the values and then select the midpoint as the split point.

To keep our decision tree simple, it’s ideal to have a small tree size. Therefore, at each step, we should choose the split that leads to the purest child nodes.

Two commonly used criteria for measuring impurity are Gini and Entropy. It’s important to note that these criteria are not directly used to select the predictor to split on. Instead, we calculate the difference in impurity before and after splitting.

When computing Gini or entropy after splitting, we apply weights to both splits to account for their proportions.

Multi-class

Decision tree can naturally handle multi class problem as entropy and gini can be calculated for multi-class.

Entropy

  • Formula:
    • H = ∑ -p * log(p)
  • About :
    • Randomness
    • Information Carries
    • Highest when p=0.5
    • Higher Entropy => Higher Randomness => Carries more information
    • After splitting we want entropy to reduce
  • Initial :
    • n0 positive and m0 negative sample
    • g0 = (n+m) total samples
  • After Splitting
    • Group 1:
      • n1 positive
      • m1 negative
      • g1 = (n1 + m1) total
    • Group 2:
      • n2 positive
      • m2 negative
      • g2 = (n2 + m2) total
  • Before Entropy :
    • H_before =  -(n0/g0) log(n0/g0) – (m0/g0)log(m0/g0)
  • After Entropy :
    • H1 = -(n1/g1) log(n1/g1) – (m1/g1)log(m1/g1)
    • H2 = -(n2/g2) log(n2/g2) – (m2/g2)log(m2/g2)
    • H_after = (g1/g0) * H1 + (g2/g0)*H2
  • diff = H_before – H_after
  • Select a predictor for which diff is highest and split on it
    • Which means we are selecting a predictor which reduces randomness more
    • Which also mean we are selecting a predictor which reduces information more
      • So we are selecting a variable which carries more information
        • More important feature

Gini

  • Gini for 2 class :
    • G = p1 * (1-p1) + p2 * (1 – p2)
    • Using (p1 + p2 = 1) we can derive that G = 2*p1*p2
  • Gini in general:
    • G = ∑ p * (1 – p) = 1 – ∑ p²
  • Like entropy gini is maximum when p1 = p2 = 0.5
    • In this case nodes are least pure
  • Gini is a measure of impurity which we want to reduce by splitting on predictor
  • diff = G_before – G_after
    • We will select a predictor for which diff is highest

Entropy vs Gini

  • Given a choice gini will be better as computing log is costlier.
  • Both of them are better than “classification error as metric” [1]
    • Which would be selecting majority class for calculating classification error
    • It is useful while pruning but not while growing the tree

ROC Curve

  • ROC curve can be constructed easily for classifier which outputs ranking.
  • On the contrary decision tree outputs label
  • However to get a ROC we can use workaround. Each leaf node will have some positive and negative samples and we can calculate probability as (# positive / # total).
  • Then we can change the threshold and plot ROC.

Regularization

Regularization in decision tree will be explained in separate blog in detail, but here is the list of candidate techniques.

  1. limit max. depth of trees
  2. Cost complexity pruning
  3. ensembles / bag more than just 1 tree
  4. set stricter stopping criterion on when to split a node further (e.g. min gain, number of samples etc.)

Reference

[0] : Applied predictive modeling by Max Kuhn and Kjell Johnson

[1] : https://github.com/rasbt/python-machine-learning-book/blob/master/faq/decision-tree-binary.md

On Clustering

K-mean is probably most popular algorithm and most taught algorithms in academia. However it has got many limitation and listing some of them here:

  • You need to specify value of k
  • Can cluster non-clustered data
  • Sensitive to scale
  • Even on perfect data sets, it can get stuck in a local minimum
  • Means are continuous
  • Hidden assumption: SSE is worth minimizing
  • K-means serves more as quantification

 

In Hierarchical clustering you don’t need to specify values of k, you can sample any level from the tree it build either by top down or bottom up approach. Such a tree is called Dendrogram.

Scikit also supports variety of clustering algorithms including DBSCAN and lists which one suits when. http://scikit-learn.org/stable/modules/clustering.html

 

References:

https://stats.stackexchange.com/questions/133656/how-to-understand-the-drawbacks-of-k-means

 

No Free Lunch (NFL)

When averaged across all possible situations, every algorithm performs equally well 🙂

Imagine you have two different classification problems: one involves identifying fraudulent credit card transactions, and the other involves diagnosing medical conditions based on patient symptoms. These two problems have distinct characteristics and requirements.

Now, let’s say you have Algorithm A, which performs exceptionally well on the credit card fraud detection problem. It has been extensively optimized and fine-tuned for this specific problem domain, resulting in high accuracy and low false positive rates.

On the other hand, you have Algorithm B, which excels in diagnosing medical conditions. It has been trained on a vast dataset of patient records and has been shown to provide accurate diagnoses with minimal errors.

If you were to switch the algorithms and apply Algorithm A to the medical diagnosis problem and Algorithm B to the credit card fraud detection problem, you may encounter subpar performance. Algorithm A, which was designed specifically for fraud detection, might struggle to correctly identify medical conditions based on symptoms. Similarly, Algorithm B, which was trained on medical data, may not be effective at detecting credit card fraud.

This example illustrates the essence of the No Free Lunch theorem. While Algorithm A and Algorithm B excel in their respective problem domains, they do not have universal superiority. Each algorithm is designed and optimized for a specific problem, taking into account its unique characteristics and requirements.

The NFL theorem reminds us that when approaching a new problem, it is essential to carefully analyze its specific characteristics, data distribution, and requirements. It is necessary to evaluate and select an algorithm or approach that aligns well with those specific factors, rather than assuming that a single algorithm can provide optimal performance across all problem domains.

In the context of the No Free Lunch (NFL) theorem, an algorithm refers to a specific computational method or procedure used to solve a problem or perform a task. Algorithms can be mathematical, statistical, heuristic, or machine learning-based, among others. Example being Logistic regression, CNN, RNN etc.

In summary, the No Free Lunch theorem emphasizes the importance of considering problem-specific factors and carefully selecting algorithms or approaches that suit the unique requirements of each problem. Different problems may require different algorithms, and there is no universally superior algorithm that can perform optimally for all possible problems.

http://www.no-free-lunch.org/

http://scholar.google.com/scholar?q=”no+free+lunch”+Wolpert&#8221;

Support Vector Machines

Maximum margin classifiers

  • Also known as optimal separating hyperplane
  • Margin is the distance between hyperplane and closest training data point
  • We want to select a hyperplane for which this distance is maximum
  • Once we identify optimal separating hyper plane there can be many equidistance training points with the shortest distance from hyperplane
  • Such point are called support vectors
    • These points support the hyperplane in a sense that if they are moved slightly optimal hyperplane will also move

Training

svm1
  • Equation 9.10 ensures that left hand side of equation 9.11 gives perpendicular distance from the hyperplane
  • Equation assumes y has two values (+1) and (-1)

Support Vector classifier

  • Maximum margin classifier does not work when supporting hyperplane does not exist.
  • Support vector classifier relaxes optimization objective to get that work
  • Unlike maximum margin classifier this one is less prone to overfit as well
    • The formal one is very sensitive to change in single observation
  • Also know as soft margin classifier
svm2
  • ε variable allows training point to be on wrong side of margin
    • If ε > 0 it is on the wrong side of margin
    • If ε > 1 it is on the wrong side of hyperplane
  • Parameter C is the budget that constraints how many points are allowed on wrong side of hyperplane
  • C is selected with cross validation and controls bias variance trade off
  • Point that lies directly on the margin or on the wrong side of margin for their class are called support vectors
    • Because these points affects the choice of hyperplane
    • And this is the property which makes it robust to outliers
      • LDA calculates mean of all the observation
      • However LR is less sensitivity to outliers
  • Computation note – when we try to solve above optimization problem with lagrange multiplier we found that it depends on dot product of training samples
    • This will be very important when we discuss support vector machine in next section
    • Dot product is only with support vector, both while training and while solving [1]
svm3.PNG

Support vector machine

  • Above two classifier does not work when desired decision boundary is not linear
  • One solution is to create polynomial features (as we generally do for LR)
  • But fundamental problem with this approach is that how many and which terms you should create
  • Also creating large number of feature raises computational problem
  • For the case of SVM, that fact that it involved only dot product of observation allows us to perform kernel trick.svm4svm5svm6svm7
  • Kernel acts as similarity function
  • Above equation makes it clear that we are not calculating(and storing) higher order polynomial still taking the advantage of it
  • Second one is polynomial kernel and last one is radial kernel
  • This video shows visualization of kernel trick

Multiclass SVM

 Hinge Loss

  • Loss is 0 for w*x which are not support vectors
  • Look at equation 9.14 and 9.15 above
    • Looking at its property episilon, it relates to hinge loss
    • Yet to check it in SVM lagrange solution though
  • In hinge loss also we would predict +1 if w*x > 0 else -1
  • Why not keep boundary at 0
    • This can make w a zero to make positive training example right
    • Zero parameter vector would mean while prediction we would always get 0
      • Because prediction is w*x
      • This makes it difficult to assign labels during prediction
    • We keep boundary away from 0 to make zero we don’t get 0 parameter vector

References

[0] : An Introduction to Statistical Learning – http://www-bcf.usc.edu/~gareth/ISL/

[1] : https://towardsdatascience.com/understanding-support-vector-machine-part-2-kernel-trick-mercers-theorem-e1e6848c6c4d

Comparing LDA and LR

 

Few Points:

  • LR model probability with logistic function
  • LDA models probability with multivariate Gaussian function
  • LR find maximum likelihood solution
  • LDA find maximum a posterior using Bayes’ formula

When classes are well separated

When the classes are well-separated, the parameter estimates for logistic regression are surprisingly unstable. Coefficients may go to infinity. LDA doesn’t suffer from this problem.

LR gets unstable in the case of perfect separation

If there are covariate values that can predict the binary outcome perfectly then the algorithm of logistic regression, i.e. Fisher scoring, does not even converge.

If you are using R or SAS you will get a warning that probabilities of zero and one were computed and that the algorithm has crashed.

This is the extreme case of perfect separation but even if the data are only separated to a great degree and not perfectly, the maximum likelihood estimator might not exist and even if it does exist, the estimates are not reliable.

The resulting fit is not good at all.

Math behind LR

629

For example suppose y = 0 for x=0 and y=1 for x = 1. To maximize the likelihood of the observed data, the “S”-shaped logistic regression curve has to model h(Θ) as 0 and 1. This will lead β to reach infinite, which causes the instability.

 

Few terms:

Complete Separation – when x completely predicts both zero and 1

Quasi-Complete separation – when x completely predicts either 0 or 1

When can LDA fail

It can fail if either the between or within covariance matrix(Sigma) is singular but that is a rather rare instance.

7

In fact, If there is complete or quasi-complete separation then all the better because the discriminant is more likely to be successful.

 

LDA is popular when we have more than two response classes, because it also provides low-dimensional views of the data.

In the post on LDA, QDA we had said that LDA is generalization of Fisher’s discriminant analysis (which involves project data on lower dimension to that achieves maximum separation).

 

LDA may result in information loss

The low-dimensional representation has a problem that it can result in loss of information. This is less of a problem when the data are linearly separable but if they are not the loss of information might be substantial and the classifier will perform poorly

Another assumption of LDA that it assumes equal covariance matrix for all classes, in which case we might go for QDA. Blog post on LDA, QDA list more consideration about the same.

 

References

https://stats.stackexchange.com/questions/188416/discriminant-analysis-vs-logistic-regression#

Click to access separation.pdf

Click to access Classification_HDD.pdf

On LDA, QDA

{In practice it is used more for classification than for regression}

This resemble gaussian mixture models in that you git one gaussian for each class. 
Don't forget one important difference though. LDA is supervised, Mixture models are unsupervised.

Linear Discriminant Analysis (LDA)

In logistic regression (LR), we estimate the posterior probability directly. In LDA we estimate likelihood and then use Bayes theorem. Calculating posterior using bayes theorem is easy in case of classification because hypothesis space is limited.

1234

Equation 2 computes probability of class k given x. This is a posterior instead of just point estimates.

Equation 4 is derived from equation 3 only. Probability(k) would be highest for the class for which Delta(k) will be highest.

LDA estimates mean and variance from data and uses equation 4 for classification.

5

We also need to estimate π_k, which I think would be n_k/N.

 

Assumptions made:

  • f(x) is normal
  • Variance(sigma) is same for all classes

 

When more than one predictor, we go for multivariate gaussian

67

Some comparisons

  • Compare this with mixture models, where there is a responsibility vector for each sample
    • There labels are not available (unsupervised learning) and hence is solved by EM (Expectation Maximization)
  • Compare this with naive bayes, there assumption is each feature is independent
    • Here we have parameter for each (class, feature), there we have parameter for each feature
    • Also here f captures probability of class (k) given x, there after bayes rules we calculate probability of x given class k
    • Hence the name naive bayes
    • Here we have joint distribution (multivariate gaussian, there it is independent distribution for each features)
    • Both LDA and navie bayes try to calculate posterior while logistic regression maximizes likelihood function

 

Quadratic Descriminant Analysis (QDA)

Unlike LDA, QDA assumes that each class has its own covariance matrix. It is called quadratic because below function is quadratic of x.

8

When to use LDA, QDA

  • This is related to bias variance trade-off
  • For p predict and k classes
    • LDA estimates k*p parameters
    • QDA estimates additional k*p*(p+1)/2 parameters
  • So LDA has much lower variance and classifier built can suffer from high bias
  • LDA should be used when number of training sample are less, because we want to avoid high variance problem
  • QDA has high variance, so it should be used when number of training samples are more
    • Another scenario would the case when common covariance matrix among K classes is untenable

 

A note on Fisher’s Linear Discriminant Analysis

  • It is simply LDA in case of two classes.
  • We can derive this similarity mathematically.
  • In literature we found it from the perspective that it project data on a line which achieves maximum separation
  • We can state without loss of generality that LDA also provides low dimensional view on data

 

Math

  • We want to project 2-D data on a line which
    • maximizes the difference between projected mean
    • minimizes within class variance
  • Such a direction (w) can be found by maximizing fisher criterion (J)

fisher1fisher2fisher3

fisher4fisher5fisher6fisher7