Oversampling and Under-sampling

January 21, 2019October 25, 2020Archit Vora Leave a comment

When data is class-imbalanced there is a tendency to predict majority class. One way to tackle this would be apply more weight to minority classes in cost function. Another way is oversampling and under-smapling.

Over-sampling makes duplicate copies of minority classes
Under sampling randomly removes some samples from majority class
- This should be used with caution
- We need to check once that we still remain with enough sample for a given no of features
Practically we might want to over sample some classes and under-sample others.

Cross validation

Validation set should be taken out from original data[1]
- We can do the sampling just before training only on training data

Reference

[0] : https://www.kaggle.com/residentmario/undersampling-and-oversampling-imbalanced-data

[1] : https://www.marcoaltini.com/blog/dealing-with-imbalanced-data-undersampling-oversampling-and-proper-cross-validation

Naive Bayes Classifier

December 4, 2018May 17, 2023Archit Vora 2 Comments

There are two things[1]
- Probability model
- Classification model

Probability Model

A probability model is an extension of Bayes’ rule. It makes two assumptions:

Independence of Features: This assumption assumes that all features are independent of each other. However, it does not hold true in many cases. For example, having higher temperature does not necessarily imply higher humidity.
Equal Weight of Features: This assumption assumes that all features have equal importance or weight in the model.

Classification Model

The classification model involves the following steps:

Probability of Each Class: P(y) represents the probability of each class based on the training set.
Probability Estimation of Feature Values: The goal is to estimate the probability distribution of each feature value given a specific class, denoted as P(x_i|y). For discrete features, this can be achieved through simple probability calculations, such as multinomial Naive Bayes. For continuous features, Gaussian distributions can be used. In the case of count data, multinomial distributions are suitable.
Parameter Estimation: Parameter estimation is performed for each combination of class and feature.
Scikit-learn and Distribution Types: Scikit-learn library provides implementations of Gaussian Naive Bayes, Bernoulli Naive Bayes, and multinomial Naive Bayes classifiers. These classifiers refer to the distribution of features. It is important to note that different features can follow different distributions. Therefore, customization of the distribution based on the application may be necessary.

Advantages

Fast and Easy Implementation: Naive Bayes classifiers are known for their simplicity and efficiency in implementation.
Acceptable Classification Performance: While Naive Bayes classifiers may not always accurately predict probabilities, their classification performance is generally satisfactory.

Disadvantage

Independence Assumption: The assumption of feature independence does not hold true in all scenarios, which can affect the model’s accuracy.

Reference

[0] https://towardsdatascience.com/naive-bayes-classifier-81d512f50a7c

[1] https://en.wikipedia.org/wiki/Naive_Bayes_classifier

On Classification Accuracy – 2

October 1, 2018October 25, 2020Archit Vora 1 Comment

We have already talked about it in this post. Just want to add few more things after finishing a course. This post is just an extension of above with some practical considerations.

We are claiming that accuracy may not be a good measure always. When you are building automated machine learning you must trust it.

Case Study

You want to show positive reviews on your website.
Say in your dataset 90% reviews are negative.
A classifier can achieve 90% accuracy by predicting all of them as negative.
But what you are interested in is finding out remaining 10% and display it on your website.

Precision = Did I show something negative?

Recall = How good I am at finding positive reviews?

Analogy with Optimist and Pessimist

Optimist assigns every/most review as positive
- Very good recall, but less precision
Pessimist assigns every/most review with negative
- Bad recall, good precision

Trade-off

Trade-off comes while scoring, not while training
We can assign labels based on probabilities
Decision tree gives probability by no of positive and negative samples at leaf node
Logistic regression of-course gives probability
We can change threshold to trade off between precision and recall
Positive when prob > 1 => Pessimist
Positive when prob > 0 => Optimist

Single no not always useful

Single numbers like F1 score and AUC are something I am not great fan of
You can not always choose classifier just by AUC, ROC curve might intersesct
- This intersection means that one classifier is better at some range of precision
- But if they don’t intersect we choose the one with higher AUC
From business perspective we are should be clear whether we want more precision or recall
Another practical metric they talked about was precision at k
- Say I want to display 5 reviews on my website
- What is the precision after 5 values I have chosen

Classification Trees

May 1, 2018May 21, 2023Archit Vora 3 Comments

How classification Tree creates Rectangle in Predictor Space

Criterion for Choosing the Splitting Predictor

When deciding which predictor to split on, we need to consider different types of predictors: continuous, binary, and categorical. In this post, we will focus on continuous and binary predictors.

For categorical predictors, one approach is to use the “one vs rest” strategy. We evaluate each category separately and observe which one reduces the randomness the most when split.

In the case of a continuous predictor, we first sort the values and then select the midpoint as the split point.

To keep our decision tree simple, it’s ideal to have a small tree size. Therefore, at each step, we should choose the split that leads to the purest child nodes.

Two commonly used criteria for measuring impurity are Gini and Entropy. It’s important to note that these criteria are not directly used to select the predictor to split on. Instead, we calculate the difference in impurity before and after splitting.

When computing Gini or entropy after splitting, we apply weights to both splits to account for their proportions.

Multi-class

Decision tree can naturally handle multi class problem as entropy and gini can be calculated for multi-class.

Entropy

Formula:
- H = ∑ -p * log(p)
About :
- Randomness
- Information Carries
- Highest when p=0.5
- Higher Entropy => Higher Randomness => Carries more information
- After splitting we want entropy to reduce

Initial :
- n0 positive and m0 negative sample
- g0 = (n+m) total samples
After Splitting
- Group 1:
  - n1 positive
  - m1 negative
  - g1 = (n1 + m1) total
- Group 2:
  - n2 positive
  - m2 negative
  - g2 = (n2 + m2) total

Before Entropy :
- H_before = -(n0/g0) log(n0/g0) – (m0/g0)log(m0/g0)
After Entropy :
- H1 = -(n1/g1) log(n1/g1) – (m1/g1)log(m1/g1)
- H2 = -(n2/g2) log(n2/g2) – (m2/g2)log(m2/g2)
- H_after = (g1/g0) * H1 + (g2/g0)*H2
diff = H_before – H_after
Select a predictor for which diff is highest and split on it
- Which means we are selecting a predictor which reduces randomness more
- Which also mean we are selecting a predictor which reduces information more
  - So we are selecting a variable which carries more information
    - More important feature

Gini

Gini for 2 class :
- G = p1 * (1-p1) + p2 * (1 – p2)
- Using (p1 + p2 = 1) we can derive that G = 2*p1*p2
Gini in general:
- G = ∑ p * (1 – p) = 1 – ∑ p²
Like entropy gini is maximum when p1 = p2 = 0.5
- In this case nodes are least pure
Gini is a measure of impurity which we want to reduce by splitting on predictor
diff = G_before – G_after
- We will select a predictor for which diff is highest

Entropy vs Gini

Given a choice gini will be better as computing log is costlier.
Both of them are better than “classification error as metric” [1]
- Which would be selecting majority class for calculating classification error
- It is useful while pruning but not while growing the tree

ROC Curve

ROC curve can be constructed easily for classifier which outputs ranking.
- For example logistic recession outputs probability where we can change the threshold.
On the contrary decision tree outputs label
However to get a ROC we can use workaround. Each leaf node will have some positive and negative samples and we can calculate probability as (# positive / # total).
Then we can change the threshold and plot ROC.

Regularization

Regularization in decision tree will be explained in separate blog in detail, but here is the list of candidate techniques.

limit max. depth of trees
Cost complexity pruning
ensembles / bag more than just 1 tree
set stricter stopping criterion on when to split a node further (e.g. min gain, number of samples etc.)

Reference

[0] : Applied predictive modeling by Max Kuhn and Kjell Johnson

[1] : https://github.com/rasbt/python-machine-learning-book/blob/master/faq/decision-tree-binary.md

Classification – One vs Rest and One vs One

June 11, 2017May 21, 2023Archit Vora 1 Comment

In the blog post on Cost Function And Hypothesis for LR we noted that LR (Logistic Regression) inherently models binary classification. Here we will describe two approaches used to extend it for multiclass classification.

One vs Rest approach takes one class as positive and rest all as negative and trains the classifier. So for the data having n-classes it trains n classifiers. Now in the scoring phase all the n-classifier predicts probability of particular class and class with highest probability is selected.

One vs One considers each binary pair of classes and trains classifier on subset of data containing those classes. So it trains total n*(n-1)/2 classes. During the classification phases each classifier predicts one class. (This is contrast to one vs rest where each classifier predicts probability). And the class which has been predicted most is the answer.

Example

For example consider four class problem having classes A, B, C, and D.

One vs Rest

Models classifiers_A, classifier_B, classifier_C and classifier_D
During prediction here is the probability we get:
- classifier_A = 40% = prob(class A)
- classifier_B = 30%
- classifier_C = 60%
- classifier_D = 50%
We assign it class B
Note that Summation might not be come out to 1
- Prob(class A) + Prob(class B) + Prob(class C)

One vs One

We train total six classifier with subset of data containing classes involved
- classifier_AB
- classifier_AC
- classifier_AD
- classifier_BC
- classifier_BD
- classifier_CD
And during classification
- classifier_AB assigns class A
- classifier_AC assigns class A
- classifier_AD assigns class A
- classifier_BC assigns class B
- classifier_BD assigns class D
- classifier_CD assigns class C
We assign it to class A
How do we estimate Prob(class A) ?
- Somewhat complicated

More notes

One vs rest trains less no of classifier and hence is faster overall and hence is usually prefered
Single classifier in one vs one uses subset of data, so single classifier is faster for one vs one
One vs one is less prone to imbalance in dataset (dominance of particular classes)

Inconsistency

What if two class gets equal vote in the case of one vs one case
What if probability are almost close to equal in case of one vs rest
We will discuss this issue in further blog posts

Data Stories

Tag Classification

Oversampling and Under-sampling

Cross validation

Reference

Naive Bayes Classifier

Probability Model

Classification Model

Advantages

Disadvantage

Reference

On Classification Accuracy – 2

Classification Trees

Criterion for Choosing the Splitting Predictor

Multi-class

Entropy

Gini

Entropy vs Gini

ROC Curve

Regularization

Reference

Classification – One vs Rest and One vs One

Example

More notes

Inconsistency