NN : Batch Norm and Softmax Regression

January 15, 2019May 17, 2023Archit Vora

This post is a lecture summary of Week 3 of the deep learning course by Andrew N. G, available at https://www.coursera.org/learn/deep-neural-network/home/welcome.

Hyperparameter Tuning

When the number of parameters is large, random search is better than grid search.
Grid search is more useful when the number of parameters is small, as it is more systematic.
Not all hyperparameters are equally important.
Choosing between babysitting one model (“Panda” strategy) or training multiple models in parallel (“Caviar”) depends on the computational resources available.

Batch Normalization

Normalizing input features speeds up learning by making the contours more circular.
In practice, z[2] is normalized instead of a[2].
β and γ are introduced to allow for non-zero mean and non-unit variance.
- Suppose you have sigmoid activation and you want larger variance to better exploit non-linearity
Different β and γ values are used for each layer.
In deep learning frameworks, batch normalization is often a single flag.
When using normalization, the bias term (b) has no effect and can be eliminated.

screen shot 2019-01-15 at 11.53.57 am

screen shot 2019-01-15 at 12.06.40 pm

Batch normalization speeds up training, limits distribution shifts of activation, provides more consistent data to later layers, and has a slight regularization effect.
Mean-variance scaling with different β and γ values for each mini-batch introduces noise similar to dropout, challenging later layers to not depend on a single feature.
Larger mini-batch sizes result in smaller regularization, but batch normalization is not primarily used for regularization.
During scoring, calculate µ and σ using exponentially weighted averages across mini-batches for each layer during training.
These averages are running averages and do not require much memory.
Use the above values for scoring.

Softmax Regression

Softmax regression is a generalization of logistic regression.
The output vector has dimensions (C, 1) and uses the “Softmax activation” function.
Softmax activation involves taking exponentials and normalizing the values.

softmax_layer

When C = 2, softmax reduces to logistic regression.
The loss function remains the same: cross-entropy loss.
Only one class will have an actual value of 1, following the maximum likelihood function.

loss_function

The gradient of the last layer is dz = ŷ – y.

backprop

Leave a comment Cancel reply