Softmax and cross entropy Loss

Softmax:

  • To explain softmax, Andrew Ng uses the terms “hard-max” and “soft-max.”
  • Softmax calculates the output probabilities of various classes using the formula: y_pred = exp(z_i) / sum_over_i ( exp(z_i) ).
  • Softmax outputs the probability distribution of the classes.
  • In hardmax, we assign one class as 1 and the others as 0.

Cross Entropy:

  • Cross entropy is a loss function commonly used in classification tasks.
  • The loss is calculated using the formula: Loss = - sum [y_actual * log(y_pred)].
  • For example, if the actual class is [1, 0, 0, 0, 0]:
    • y_pred_1 = [0.1, 0.5, 0.1, 0.1, 0.2]
    • y_pred_2 = [0.1, 0.6, 0.1, 0.1, 0.1]
  • The loss will be the same for y_pred_1 and y_pred_2.
  • This is a key feature of multiclass log loss: it rewards or penalizes the probabilities of correct classes only, and the value is independent of how the remaining probability is split between incorrect classes. [0]
  • Cross entropy is same as loss function of logistic regression, it is just that there are two classes.
8

References:

NN : Batch Norm and Softmax Regression

This post is a lecture summary of Week 3 of the deep learning course by Andrew N. G, available at https://www.coursera.org/learn/deep-neural-network/home/welcome.

Hyperparameter Tuning

  • When the number of parameters is large, random search is better than grid search.
  • Grid search is more useful when the number of parameters is small, as it is more systematic.
  • Not all hyperparameters are equally important.
  • Choosing between babysitting one model (“Panda” strategy) or training multiple models in parallel (“Caviar”) depends on the computational resources available.

Batch Normalization

  • Normalizing input features speeds up learning by making the contours more circular.
  • In practice, z[2] is normalized instead of a[2].
  • β and γ are introduced to allow for non-zero mean and non-unit variance.
    • Suppose you have sigmoid activation and you want larger variance to better exploit non-linearity
  • Different β and γ values are used for each layer.
  • In deep learning frameworks, batch normalization is often a single flag.
  • When using normalization, the bias term (b) has no effect and can be eliminated.

screen shot 2019-01-15 at 11.53.57 am

screen shot 2019-01-15 at 12.06.40 pm

  • Batch normalization speeds up training, limits distribution shifts of activation, provides more consistent data to later layers, and has a slight regularization effect.
  • Mean-variance scaling with different β and γ values for each mini-batch introduces noise similar to dropout, challenging later layers to not depend on a single feature.
  • Larger mini-batch sizes result in smaller regularization, but batch normalization is not primarily used for regularization.
  • During scoring, calculate µ and σ using exponentially weighted averages across mini-batches for each layer during training.
  • These averages are running averages and do not require much memory.
  • Use the above values for scoring.

Softmax Regression

  • Softmax regression is a generalization of logistic regression.
  • The output vector has dimensions (C, 1) and uses the “Softmax activation” function.
  • Softmax activation involves taking exponentials and normalizing the values.
softmax_layer
  • When C = 2, softmax reduces to logistic regression.
  • The loss function remains the same: cross-entropy loss.
  • Only one class will have an actual value of 1, following the maximum likelihood function.
loss_function
  • The gradient of the last layer is dz = ŷ – y.
backprop