multiclass | Data Stories

Hyperparameter Tuning

When the number of parameters is large, random search is better than grid search.

Grid search is more useful when the number of parameters is small, as it is more systematic.

Not all hyperparameters are equally important.

Choosing between babysitting one model (“Panda” strategy) or training multiple models in parallel (“Caviar”) depends on the computational resources available.

Batch Normalization

Normalizing input features speeds up learning by making the contours more circular.

In practice, z[2] is normalized instead of a[2].

β and γ are introduced to allow for non-zero mean and non-unit variance.

Suppose you have sigmoid activation and you want larger variance to better exploit non-linearity

Different β and γ values are used for each layer.

In deep learning frameworks, batch normalization is often a single flag.

When using normalization, the bias term (b) has no effect and can be eliminated.

Batch normalization speeds up training, limits distribution shifts of activation, provides more consistent data to later layers, and has a slight regularization effect.

Mean-variance scaling with different β and γ values for each mini-batch introduces noise similar to dropout, challenging later layers to not depend on a single feature.

Larger mini-batch sizes result in smaller regularization, but batch normalization is not primarily used for regularization.

During scoring, calculate µ and σ using exponentially weighted averages across mini-batches for each layer during training.

These averages are running averages and do not require much memory.

Use the above values for scoring.

Softmax Regression

Softmax regression is a generalization of logistic regression.

The output vector has dimensions (C, 1) and uses the “Softmax activation” function.

Softmax activation involves taking exponentials and normalizing the values.

When C = 2, softmax reduces to logistic regression.

The loss function remains the same: cross-entropy loss.

Only one class will have an actual value of 1, following the maximum likelihood function.

The gradient of the last layer is dz = ŷ – y.

In the blog post on Cost Function And Hypothesis for LR we noted that LR (Logistic Regression) inherently models binary classification. Here we will describe two approaches used to extend it for multiclass classification.

One vs Rest approach takes one class as positive and rest all as negative and trains the classifier. So for the data having n-classes it trains n classifiers. Now in the scoring phase all the n-classifier predicts probability of particular class and class with highest probability is selected.

One vs One considers each binary pair of classes and trains classifier on subset of data containing those classes. So it trains total n*(n-1)/2 classes. During the classification phases each classifier predicts one class. (This is contrast to one vs rest where each classifier predicts probability). And the class which has been predicted most is the answer.

Example

For example consider four class problem having classes A, B, C, and D.

One vs Rest

Models classifiers_A, classifier_B, classifier_C and classifier_D
During prediction here is the probability we get:
- classifier_A = 40% = prob(class A)
- classifier_B = 30%
- classifier_C = 60%
- classifier_D = 50%
We assign it class B
Note that Summation might not be come out to 1
- Prob(class A) + Prob(class B) + Prob(class C)

One vs One

We train total six classifier with subset of data containing classes involved
- classifier_AB
- classifier_AC
- classifier_AD
- classifier_BC
- classifier_BD
- classifier_CD
And during classification
- classifier_AB assigns class A
- classifier_AC assigns class A
- classifier_AD assigns class A
- classifier_BC assigns class B
- classifier_BD assigns class D
- classifier_CD assigns class C
We assign it to class A
How do we estimate Prob(class A) ?
- Somewhat complicated

More notes

One vs rest trains less no of classifier and hence is faster overall and hence is usually prefered
Single classifier in one vs one uses subset of data, so single classifier is faster for one vs one
One vs one is less prone to imbalance in dataset (dominance of particular classes)

Inconsistency

What if two class gets equal vote in the case of one vs one case
What if probability are almost close to equal in case of one vs rest
We will discuss this issue in further blog posts

Data Stories

Tag multiclass

NN : Batch Norm and Softmax Regression

Hyperparameter Tuning

Batch Normalization

Softmax Regression

Classification – One vs Rest and One vs One

Example

More notes

Inconsistency