Ensemble, Bagging and Boosting

July 23, 2018October 25, 2020Archit Vora Leave a comment

Ensembling is a method of combining more than one models to generate a final output.(Reference [1])

There are two ways of doing that:

Bagging
Boosting

Bagging	Boosting
We take subset of data and train different models Example Random forest It takes subset of data as well as subset of features Pros of random forest Handles high dimensions Handles missing values Cons of random forest It won’t give precise value regression because final value is mean from subset tress None the less people are using it for regression depending upon domain	We train different model on with same data. Each sample is assigned different weight in each iteration Example AdaBoost, XgBoost Pros of XgBoost Supports different loss function Works well with interactions Cons of XbBoost Prone to overfitting Tuning of hyper parameters is critical

Pros-cons of bagging vs boosting:

Bagging is easy to parallelize and hence training is faster
Boosting is more efficient for fixed no of iterations (classifiers)

AdaBoost vs XgBoost

Reference : [2]

ada_vs_xg

Quote from Tianqi Chen, one of the developers of XGBoost:

Adaboost and gradboosting [XGBoost] are two different ways to derive boosters. Both are generic. I like gradboosting better because it works for generic loss functions, while adaboost is derived mainly for classification with exponential loss.

Reference :

[1] https://towardsdatascience.com/decision-tree-ensembles-bagging-and-boosting-266a8ba60fd9

[2] https://www.packtpub.com/mapt/book/big_data_and_business_intelligence/9781788295758/4/ch04lvl1sec34/comparison-between-adaboosting-versus-gradient-boosting

AdaBoost

July 23, 2018October 25, 2020Archit Vora 2 Comments

AdaBoost stands for adaptive boosting
Boosting is a technique for creating better classifier form several regular/weak/base classifiers
Thought:
- We want to train second classifier in such a way that training data has more samples for the case where first classifier was wrong
- Third classifier will have more samples where first and second classifier contradicts
How do we accomplish above:
- We assign weights to each sample
  - This weights will be used in error function for each sample
  - Error function would be sum of weights of incorrectly classified samples
  - This weights keep changing in each iteration
- We also assign wight to each base (weak) classifier (α In below algorithm)
  - Formula for calculating α below is valid for binary classification
    - It will be different for multi class classification
When do we stop iteration ?
- Combined classifier is predicting all sample correctly
- No more base classifier left
Most machine learning algorithms allows to apply weight to data
- In logistic regression we can multiply weight in error fucntion
- In decision tree, classification error can be calculated by summing weights of missclassified samples

adboost

Formulations in above algorithm:

Formula of α
- if ε=0.5 then α=0. It is just a random chance classifier.
- Else it can positive/negative.
- ε < 0.5 => negative α => we invert the prediction in weighted average
Formula of D_(t+1)(i)
- y*h is just a sign
  - y and h are {-1, 1}
  - y*h will be positive for correct prediction
- Why it is function of ε
  - Random classifier updating weights does not make sense
By Z_t we are normalizing weights, which ensures numerical stability

Adaboost Theorem

Error -> 0 as no of iteration -> infinity at certain condition
Condition is here is that each learner has error < 0.5
Each iteration means adding one weak learner

Over-fitting

Contrary to weak classifier (decision stumps) we saw that rate of increase test error is less with boosting
However no of iteration still should be selected via cross validation for small data-set and validation for large data-set

References:

Click to access MIT6_034F10_boosting.pdf

Click to access msri.pdf

Regression Trees

May 1, 2018May 21, 2023Archit Vora 2 Comments

Figure below shows how decision tree creates rectangle in predictor space:

As we have describe classification tree, the only difference here is how to come up with output instead of label and how to define a metric other than entropy/gini.

For output it uses the mean of leaves.

For metric CART methodology uses simple SSE.

SSE = ∑ (y_i – y1)² + ∑(y_i – y2)²
- where y1 and y2 are means of two newly crated groups.
- It is simple sum of squares, not standard deviation
- While deriving R2 in linear recession we were concerned with sum of squares
- Ward linkage criterion in hierarchical clustering also uses difference in sum of squares
diff = SSE_before – SSE_after
- We choose a predictor which minimizes SSE the most.
- diff is highest

My hand-coded regression tree is available at : https://github.com/arcarchit/datastories/tree/master/regression_tree

Reference

Applied predictive modeling by Max Kuhn and Kjell Johnson

Classification Trees

May 1, 2018May 21, 2023Archit Vora 3 Comments

How classification Tree creates Rectangle in Predictor Space

Criterion for Choosing the Splitting Predictor

When deciding which predictor to split on, we need to consider different types of predictors: continuous, binary, and categorical. In this post, we will focus on continuous and binary predictors.

For categorical predictors, one approach is to use the “one vs rest” strategy. We evaluate each category separately and observe which one reduces the randomness the most when split.

In the case of a continuous predictor, we first sort the values and then select the midpoint as the split point.

To keep our decision tree simple, it’s ideal to have a small tree size. Therefore, at each step, we should choose the split that leads to the purest child nodes.

Two commonly used criteria for measuring impurity are Gini and Entropy. It’s important to note that these criteria are not directly used to select the predictor to split on. Instead, we calculate the difference in impurity before and after splitting.

When computing Gini or entropy after splitting, we apply weights to both splits to account for their proportions.

Multi-class

Decision tree can naturally handle multi class problem as entropy and gini can be calculated for multi-class.

Entropy

Formula:
- H = ∑ -p * log(p)
About :
- Randomness
- Information Carries
- Highest when p=0.5
- Higher Entropy => Higher Randomness => Carries more information
- After splitting we want entropy to reduce

Initial :
- n0 positive and m0 negative sample
- g0 = (n+m) total samples
After Splitting
- Group 1:
  - n1 positive
  - m1 negative
  - g1 = (n1 + m1) total
- Group 2:
  - n2 positive
  - m2 negative
  - g2 = (n2 + m2) total

Before Entropy :
- H_before = -(n0/g0) log(n0/g0) – (m0/g0)log(m0/g0)
After Entropy :
- H1 = -(n1/g1) log(n1/g1) – (m1/g1)log(m1/g1)
- H2 = -(n2/g2) log(n2/g2) – (m2/g2)log(m2/g2)
- H_after = (g1/g0) * H1 + (g2/g0)*H2
diff = H_before – H_after
Select a predictor for which diff is highest and split on it
- Which means we are selecting a predictor which reduces randomness more
- Which also mean we are selecting a predictor which reduces information more
  - So we are selecting a variable which carries more information
    - More important feature

Gini

Gini for 2 class :
- G = p1 * (1-p1) + p2 * (1 – p2)
- Using (p1 + p2 = 1) we can derive that G = 2*p1*p2
Gini in general:
- G = ∑ p * (1 – p) = 1 – ∑ p²
Like entropy gini is maximum when p1 = p2 = 0.5
- In this case nodes are least pure
Gini is a measure of impurity which we want to reduce by splitting on predictor
diff = G_before – G_after
- We will select a predictor for which diff is highest

Entropy vs Gini

Given a choice gini will be better as computing log is costlier.
Both of them are better than “classification error as metric” [1]
- Which would be selecting majority class for calculating classification error
- It is useful while pruning but not while growing the tree

ROC Curve

ROC curve can be constructed easily for classifier which outputs ranking.
- For example logistic recession outputs probability where we can change the threshold.
On the contrary decision tree outputs label
However to get a ROC we can use workaround. Each leaf node will have some positive and negative samples and we can calculate probability as (# positive / # total).
Then we can change the threshold and plot ROC.

Regularization

Regularization in decision tree will be explained in separate blog in detail, but here is the list of candidate techniques.

limit max. depth of trees
Cost complexity pruning
ensembles / bag more than just 1 tree
set stricter stopping criterion on when to split a node further (e.g. min gain, number of samples etc.)

Reference

[0] : Applied predictive modeling by Max Kuhn and Kjell Johnson

[1] : https://github.com/rasbt/python-machine-learning-book/blob/master/faq/decision-tree-binary.md

On Clustering

February 1, 2018November 14, 2018Archit Vora Leave a comment

K-mean is probably most popular algorithm and most taught algorithms in academia. However it has got many limitation and listing some of them here:

You need to specify value of k
Can cluster non-clustered data
Sensitive to scale
Even on perfect data sets, it can get stuck in a local minimum
Means are continuous
Hidden assumption: SSE is worth minimizing
K-means serves more as quantification

In Hierarchical clustering you don’t need to specify values of k, you can sample any level from the tree it build either by top down or bottom up approach. Such a tree is called Dendrogram.

Scikit also supports variety of clustering algorithms including DBSCAN and lists which one suits when. http://scikit-learn.org/stable/modules/clustering.html

References:

https://stats.stackexchange.com/questions/133656/how-to-understand-the-drawbacks-of-k-means

No Free Lunch (NFL)

February 1, 2018May 18, 2023Archit Vora Leave a comment

When averaged across all possible situations, every algorithm performs equally well 🙂

Imagine you have two different classification problems: one involves identifying fraudulent credit card transactions, and the other involves diagnosing medical conditions based on patient symptoms. These two problems have distinct characteristics and requirements.

Now, let’s say you have Algorithm A, which performs exceptionally well on the credit card fraud detection problem. It has been extensively optimized and fine-tuned for this specific problem domain, resulting in high accuracy and low false positive rates.

On the other hand, you have Algorithm B, which excels in diagnosing medical conditions. It has been trained on a vast dataset of patient records and has been shown to provide accurate diagnoses with minimal errors.

If you were to switch the algorithms and apply Algorithm A to the medical diagnosis problem and Algorithm B to the credit card fraud detection problem, you may encounter subpar performance. Algorithm A, which was designed specifically for fraud detection, might struggle to correctly identify medical conditions based on symptoms. Similarly, Algorithm B, which was trained on medical data, may not be effective at detecting credit card fraud.

This example illustrates the essence of the No Free Lunch theorem. While Algorithm A and Algorithm B excel in their respective problem domains, they do not have universal superiority. Each algorithm is designed and optimized for a specific problem, taking into account its unique characteristics and requirements.

The NFL theorem reminds us that when approaching a new problem, it is essential to carefully analyze its specific characteristics, data distribution, and requirements. It is necessary to evaluate and select an algorithm or approach that aligns well with those specific factors, rather than assuming that a single algorithm can provide optimal performance across all problem domains.

In the context of the No Free Lunch (NFL) theorem, an algorithm refers to a specific computational method or procedure used to solve a problem or perform a task. Algorithms can be mathematical, statistical, heuristic, or machine learning-based, among others. Example being Logistic regression, CNN, RNN etc.

In summary, the No Free Lunch theorem emphasizes the importance of considering problem-specific factors and carefully selecting algorithms or approaches that suit the unique requirements of each problem. Different problems may require different algorithms, and there is no universally superior algorithm that can perform optimally for all possible problems.

http://www.no-free-lunch.org/

http://scholar.google.com/scholar?q=”no+free+lunch”+Wolpert”

SVM Solution Lagrange

November 1, 2017October 25, 2020Archit Vora Leave a comment

Reference :

Click to access cs229-notes3.pdf

Support Vector Machines

June 17, 2017May 17, 2023Archit Vora 2 Comments

Maximum margin classifiers

Also known as optimal separating hyperplane
Margin is the distance between hyperplane and closest training data point
We want to select a hyperplane for which this distance is maximum
Once we identify optimal separating hyper plane there can be many equidistance training points with the shortest distance from hyperplane
Such point are called support vectors
- These points support the hyperplane in a sense that if they are moved slightly optimal hyperplane will also move

Training

Equation 9.10 ensures that left hand side of equation 9.11 gives perpendicular distance from the hyperplane
Equation assumes y has two values (+1) and (-1)

Support Vector classifier

Maximum margin classifier does not work when supporting hyperplane does not exist.
Support vector classifier relaxes optimization objective to get that work
Unlike maximum margin classifier this one is less prone to overfit as well
- The formal one is very sensitive to change in single observation
Also know as soft margin classifier

ε variable allows training point to be on wrong side of margin
- If ε > 0 it is on the wrong side of margin
- If ε > 1 it is on the wrong side of hyperplane
Parameter C is the budget that constraints how many points are allowed on wrong side of hyperplane
C is selected with cross validation and controls bias variance trade off
Point that lies directly on the margin or on the wrong side of margin for their class are called support vectors
- Because these points affects the choice of hyperplane
- And this is the property which makes it robust to outliers
  - LDA calculates mean of all the observation
  - However LR is less sensitivity to outliers
Computation note – when we try to solve above optimization problem with lagrange multiplier we found that it depends on dot product of training samples
- This will be very important when we discuss support vector machine in next section
- Dot product is only with support vector, both while training and while solving [1]

Support vector machine

Above two classifier does not work when desired decision boundary is not linear

One solution is to create polynomial features (as we generally do for LR)
But fundamental problem with this approach is that how many and which terms you should create
Also creating large number of feature raises computational problem

For the case of SVM, that fact that it involved only dot product of observation allows us to perform kernel trick.
Kernel acts as similarity function
Above equation makes it clear that we are not calculating(and storing) higher order polynomial still taking the advantage of it
Second one is polynomial kernel and last one is radial kernel
This video shows visualization of kernel trick

Multiclass SVM

One vs all and one vs rest is used for multiclass classification using SVM

Hinge Loss

Loss is 0 for w*x which are not support vectors
Look at equation 9.14 and 9.15 above
- Looking at its property episilon, it relates to hinge loss
- Yet to check it in SVM lagrange solution though
In hinge loss also we would predict +1 if w*x > 0 else -1
Why not keep boundary at 0
- This can make w a zero to make positive training example right
- Zero parameter vector would mean while prediction we would always get 0
  - Because prediction is w*x
  - This makes it difficult to assign labels during prediction
- We keep boundary away from 0 to make zero we don’t get 0 parameter vector

References

[0] : An Introduction to Statistical Learning – http://www-bcf.usc.edu/~gareth/ISL/

[1] : https://towardsdatascience.com/understanding-support-vector-machine-part-2-kernel-trick-mercers-theorem-e1e6848c6c4d

Comparing LDA and LR

June 16, 2017April 17, 2018Archit Vora Leave a comment

Few Points:

LR model probability with logistic function
LDA models probability with multivariate Gaussian function
LR find maximum likelihood solution
LDA find maximum a posterior using Bayes’ formula

When classes are well separated

When the classes are well-separated, the parameter estimates for logistic regression are surprisingly unstable. Coefficients may go to infinity. LDA doesn’t suffer from this problem.

LR gets unstable in the case of perfect separation

If there are covariate values that can predict the binary outcome perfectly then the algorithm of logistic regression, i.e. Fisher scoring, does not even converge.

If you are using R or SAS you will get a warning that probabilities of zero and one were computed and that the algorithm has crashed.

This is the extreme case of perfect separation but even if the data are only separated to a great degree and not perfectly, the maximum likelihood estimator might not exist and even if it does exist, the estimates are not reliable.

The resulting fit is not good at all.

Math behind LR

For example suppose y = 0 for x=0 and y=1 for x = 1. To maximize the likelihood of the observed data, the “S”-shaped logistic regression curve has to model h(Θ) as 0 and 1. This will lead β to reach infinite, which causes the instability.

Few terms:

Complete Separation – when x completely predicts both zero and 1

Quasi-Complete separation – when x completely predicts either 0 or 1

When can LDA fail

It can fail if either the between or within covariance matrix(Sigma) is singular but that is a rather rare instance.

In fact, If there is complete or quasi-complete separation then all the better because the discriminant is more likely to be successful.

LDA is popular when we have more than two response classes, because it also provides low-dimensional views of the data.

In the post on LDA, QDA we had said that LDA is generalization of Fisher’s discriminant analysis (which involves project data on lower dimension to that achieves maximum separation).

LDA may result in information loss

The low-dimensional representation has a problem that it can result in loss of information. This is less of a problem when the data are linearly separable but if they are not the loss of information might be substantial and the classifier will perform poorly

Another assumption of LDA that it assumes equal covariance matrix for all classes, in which case we might go for QDA. Blog post on LDA, QDA list more consideration about the same.

References

https://stats.stackexchange.com/questions/188416/discriminant-analysis-vs-logistic-regression#

Click to access separation.pdf

Click to access Classification_HDD.pdf

On LDA, QDA

June 16, 2017May 5, 2020Archit Vora 2 Comments

{In practice it is used more for classification than for regression}

This resemble gaussian mixture models in that you git one gaussian for each class. 
Don't forget one important difference though. LDA is supervised, Mixture models are unsupervised.

Linear Discriminant Analysis (LDA)

In logistic regression (LR), we estimate the posterior probability directly. In LDA we estimate likelihood and then use Bayes theorem. Calculating posterior using bayes theorem is easy in case of classification because hypothesis space is limited.

Equation 2 computes probability of class k given x. This is a posterior instead of just point estimates.

Equation 4 is derived from equation 3 only. Probability(k) would be highest for the class for which Delta(k) will be highest.

LDA estimates mean and variance from data and uses equation 4 for classification.

We also need to estimate π_k, which I think would be n_k/N.

Assumptions made:

f(x) is normal
Variance(sigma) is same for all classes

When more than one predictor, we go for multivariate gaussian

Some comparisons

Compare this with mixture models, where there is a responsibility vector for each sample
- There labels are not available (unsupervised learning) and hence is solved by EM (Expectation Maximization)
Compare this with naive bayes, there assumption is each feature is independent
- Here we have parameter for each (class, feature), there we have parameter for each feature
- Also here f captures probability of class (k) given x, there after bayes rules we calculate probability of x given class k
- Hence the name naive bayes
- Here we have joint distribution (multivariate gaussian, there it is independent distribution for each features)
- Both LDA and navie bayes try to calculate posterior while logistic regression maximizes likelihood function

Quadratic Descriminant Analysis (QDA)

Unlike LDA, QDA assumes that each class has its own covariance matrix. It is called quadratic because below function is quadratic of x.

When to use LDA, QDA

This is related to bias variance trade-off
For p predict and k classes
- LDA estimates k*p parameters
- QDA estimates additional k*p*(p+1)/2 parameters
So LDA has much lower variance and classifier built can suffer from high bias
LDA should be used when number of training sample are less, because we want to avoid high variance problem
QDA has high variance, so it should be used when number of training samples are more
- Another scenario would the case when common covariance matrix among K classes is untenable

A note on Fisher’s Linear Discriminant Analysis

It is simply LDA in case of two classes.
We can derive this similarity mathematically.
In literature we found it from the perspective that it project data on a line which achieves maximum separation
We can state without loss of generality that LDA also provides low dimensional view on data

Math

We want to project 2-D data on a line which
- maximizes the difference between projected mean
- minimizes within class variance
Such a direction can be found by maximizing fisher criterion (J)

fisher1 fisher2

Data Stories

Category Machine Learning

Ensemble, Bagging and Boosting

AdaBoost vs XgBoost

AdaBoost

Regression Trees

Reference

Classification Trees

Criterion for Choosing the Splitting Predictor

Multi-class

Entropy

Gini

Entropy vs Gini

ROC Curve

Regularization

Reference

On Clustering

No Free Lunch (NFL)

SVM Solution Lagrange

Support Vector Machines

Maximum margin classifiers

Support Vector classifier

Support vector machine

Multiclass SVM

Hinge Loss

References

Comparing LDA and LR

When classes are well separated

On LDA, QDA

Linear Discriminant Analysis (LDA)

Assumptions made:

When more than one predictor, we go for multivariate gaussian

Some comparisons

Quadratic Descriminant Analysis (QDA)

When to use LDA, QDA

A note on Fisher’s Linear Discriminant Analysis