Softmax and cross entropy Loss

February 28, 2019May 17, 2023Archit Vora Leave a comment

Softmax:

To explain softmax, Andrew Ng uses the terms “hard-max” and “soft-max.”
Softmax calculates the output probabilities of various classes using the formula: y_pred = exp(z_i) / sum_over_i ( exp(z_i) ).
Softmax outputs the probability distribution of the classes.
In hardmax, we assign one class as 1 and the others as 0.

Cross Entropy:

Cross entropy is a loss function commonly used in classification tasks.
The loss is calculated using the formula: Loss = - sum [y_actual * log(y_pred)].
For example, if the actual class is [1, 0, 0, 0, 0]:
- y_pred_1 = [0.1, 0.5, 0.1, 0.1, 0.2]
- y_pred_2 = [0.1, 0.6, 0.1, 0.1, 0.1]
The loss will be the same for y_pred_1 and y_pred_2.
This is a key feature of multiclass log loss: it rewards or penalizes the probabilities of correct classes only, and the value is independent of how the remaining probability is split between incorrect classes. [0]
Cross entropy is same as loss function of logistic regression, it is just that there are two classes.

References:

[0]: Stack Exchange: Cross entropy loss explanation

NN : Batch Norm and Softmax Regression

January 15, 2019May 17, 2023Archit Vora Leave a comment

This post is a lecture summary of Week 3 of the deep learning course by Andrew N. G, available at https://www.coursera.org/learn/deep-neural-network/home/welcome.

Hyperparameter Tuning

When the number of parameters is large, random search is better than grid search.
Grid search is more useful when the number of parameters is small, as it is more systematic.
Not all hyperparameters are equally important.
Choosing between babysitting one model (“Panda” strategy) or training multiple models in parallel (“Caviar”) depends on the computational resources available.

Batch Normalization

Normalizing input features speeds up learning by making the contours more circular.
In practice, z[2] is normalized instead of a[2].
β and γ are introduced to allow for non-zero mean and non-unit variance.
- Suppose you have sigmoid activation and you want larger variance to better exploit non-linearity
Different β and γ values are used for each layer.
In deep learning frameworks, batch normalization is often a single flag.
When using normalization, the bias term (b) has no effect and can be eliminated.

Batch normalization speeds up training, limits distribution shifts of activation, provides more consistent data to later layers, and has a slight regularization effect.
Mean-variance scaling with different β and γ values for each mini-batch introduces noise similar to dropout, challenging later layers to not depend on a single feature.
Larger mini-batch sizes result in smaller regularization, but batch normalization is not primarily used for regularization.
During scoring, calculate µ and σ using exponentially weighted averages across mini-batches for each layer during training.
These averages are running averages and do not require much memory.
Use the above values for scoring.

Softmax Regression

Softmax regression is a generalization of logistic regression.
The output vector has dimensions (C, 1) and uses the “Softmax activation” function.
Softmax activation involves taking exponentials and normalizing the values.

When C = 2, softmax reduces to logistic regression.
The loss function remains the same: cross-entropy loss.
Only one class will have an actual value of 1, following the maximum likelihood function.

The gradient of the last layer is dz = ŷ – y.

Optimization for NN

January 11, 2019May 17, 2023Archit Vora Leave a comment

Mini-batch Gradient Descent:

Mini-batch gradient descent exhibits oscillations during descent.
Choosing mini-batch size:
- For small training sets (< 2k samples), it is advisable to use batch gradient descent.
- For larger training sets, mini-batches of sizes such as 64, 128, 256, or 512 are commonly used.
Cross-validation helps in finding the right trade-off.
Batch gradient descent: More training time is dominated by the processing of a single duration.
Stochastic gradient descent: More training time is dominated by the number of iterations required for convergence.
- Vectorization is lost in the case of stochastic gradient descent.

Exponentially Weighted Moving Averages:

Exponentially weighted moving averages are computed using the formulas:
- Vₜ = 0.9 * Vₜ₋₁ + 0.1 * θₜ
- Vₜ = β * Vₜ₋₁ + (1 – β) * θₜ
Averaging over roughly the last 10 days of temperature is achieved using the factor 1 / (1 – 0.9).
Bias correction is necessary to eliminate the bias introduced when initializing with v₀ = 0.
The bias correction formula is: Vₜ = (1 – βᵗ) * (β * Vₜ₋₁ + (1 – β) * θₜ)

Gradient Descent with Momentum:

Gradient descent with momentum enables slower learning on the vertical axis and faster learning on the horizontal axis.
In practice, bias correction is not used after around 10 iterations.

RMSprop:

RMSprop is used to handle situations where some dw values can be large.
Adding epsilon for numerical stability helps prevent division by zero.
Notice is the dw^2 in the formula below

Adam Optimization:

Adam optimization, short for Adaptive Moment Estimation, is one of the algorithms that works well across domains.
Default values commonly used for β₁ (0.9), β₂ (0.999), and ε (10^-8)

Learning Rate Decay:

1 epoch refers to one pass through the entire data.
In the case of mini-batches, one epoch can involve multiple iterations.
Different formulas are used for learning rate decay

Local Optima:

Most points with zero gradient are not local optima but rather saddle points, especially in high-dimensional spaces.
Local optima are generally not observed due to the high dimensionality.
Plateaus can be problematic, with very small gradients leading to slower learning.

Inverted Dropout

January 11, 2019May 17, 2023Archit Vora Leave a comment

This post is a lecture summary of the deep learning course by Andrew N. G, available at https://www.coursera.org/learn/deep-neural-network/home/welcome.

During Training:

Neurons are dropped out by setting them to zero.
The activation is adjusted by dividing it with the keep probability.
The expected value of z[4] (as shown in the screenshot below) should not be altered.

During Scoring:

If “inverted dropout” is used, no additional steps are necessary.
Other dropout techniques may require some computations.

Intuition:

Dropping out neurons causes inputs to the unit to be randomly dropped.
This prevents the unit from relying too heavily on a single feature and encourages it to distribute weights across multiple features.
Different layers can have different keep probabilities.

Side Effect:

The cost function is not well defined.
It’s not possible to check if the cost is consistently decreasing every iteration.
A debugging tool is used to address this issue.

Solution:

First, verify that everything is functioning correctly without dropout.
Then, gradually introduce dropout.

———————————————————————–

Other Regularization Techniques

Data augmentation, such as horizontal flipping, random cropping, and transformations.
Early stopping: Stop training at a certain iteration (e.g., 7k instead of 10k) based on the error observed on the development set.

Downside

Balancing optimization and avoiding overfitting can be challenging.
Mixing both objectives requires careful consideration.

Advantage

Unlike L2 regularization, dropout does not necessitate trying different lambda values repeatedly.

Negative sampling in word2vec

January 31, 2018January 31, 2018Archit Vora 1 Comment

In the precious post we talked about skipgram model.

https://datastoriesweb.wordpress.com/2018/01/31/word2vec-and-skip-gram-model/

Now let’s say we have 1000 words and 300 hidden units, we shall have 300,000 wights in both hidden and output unit, which are two many parameters.

Output label during training is one hot vector with 999 zeros and single 1. We randomly select 5 zeros and update weight for six words only. (5 zeros and single 1). The more frequent word is the highr the probability of it getting selected. Google paper has mentioned some emperical formula for this. This is know as negative sampling.

Above was for output layer. In hidden layer weights are updated only for input words. (Irrespective if it is negative sampling or not)

Reference:

http://mccormickml.com/2017/01/11/word2vec-tutorial-part-2-negative-sampling/

Word2Vec and skip gram model

January 31, 2018July 21, 2021Archit Vora 1 Comment

Skip gram model

Weights of hidden layer serves as word vectors
There is just one hidden layer and one output layer(softmax)
Hidden layer does not have any activation
As input is one-hot vector, output is also one-hot vector
- output of hidden layer would be corresponding word vector
In the below diagram:
- Size of input (1 x 10000)
- Size of output (1 x 10000)
- Weight of hidden layer (10000 x 300)
- Weight of output layer (300 x 10000)
- So too many weight to learn – solution : negative sampling
Training pair would be nearby words in predefined window
- We can imagine how huge can that be
- It is pair of words both one-hot encoded
- Sure, we need to know previously size of our vocabulary (which will be dimension of one-hot vector)
The paper google release was trained on google news data and used 300 dimension vector, which means 300 neuron in hidden unit. The paper lists this no and size of training words and efficiency.
- Not there is one more parameter called named window size which was set to 5.
- It means that 5 words before and after center words are considered as pair for training data.
There is no activation function on the hidden layer neurons, but the output neurons use softmax.

Why word2vec

Earlier NLP methods used to rely on synonyms/hypernyms which is not totally contextual
- Earlier case was mainly one hot encoding of vector
“proficient” is synonym of good only in some context
New words are getting added everyday
All words are one-hot encoded
- Somewhat similar word might be orthagonal
- Size of vector become too large

Role of TF-IDF

It is a scoring mechanism
Instead of average vectors of all the words in document we can have weighted average by TF-IDF score

There are two more things:

Continuous Bag Of Words
Negative sampling

CBOW

It also takes average of context words
- One argument in the favour that averaging is valid
Both CBOW and skip gram does not add non-linearity in hidden layer
- Output layer uses softmax
- Idea is that word-embedding is used to predict target word.

References :

http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/

http://cs224d.stanford.edu/

http://web.stanford.edu/class/cs224n/syllabus.html

http://nadbordrozd.github.io/blog/2016/05/20/text-classification-with-word2vec/

Derivation of backpropogation

November 25, 2017May 10, 2020Archit Vora Leave a comment

Quick Summary:

back_prop

Detailed Derivation:

Even if you look at gradient descent below, error is multiplied by previous value. When input is higher it’s contribution to error is higher and will needs to change more.

Derivation Of Backpropagation – 2

November 5, 2017November 25, 2017Archit Vora Leave a comment

References :

Pattern Recognition and Machine Learning by Bishop [Page no 244]
Andrew NG’s course by deeplearning.ai
https://sudeepraja.github.io/Neural/

Deep learning taking off

August 21, 2017August 21, 2017Archit Vora Leave a comment

I recently started Andrew Ng’s specialization on deep learning and found these two interesting points :

One is about how performance of algorithm changes with the amount of data. Traditional algorithms have limits but Deep neural network has more advantages.

whyD

Also for the small amount of data traditional algorithms may win over neural nets with good feature engineering.

Second reason is that deep learning requires data, computation and efficient algorithms. Recent years have seen significant advancement in algorithm to increase computation efficiency. For example sigmoid to ReLU was an algorithmic change which allowed gradient to converge faster.

Ref : https://www.coursera.org/learn/neural-networks-deep-learning/home

Data Stories

Category Deep Learning

Softmax and cross entropy Loss

NN : Batch Norm and Softmax Regression

Hyperparameter Tuning

Batch Normalization

Softmax Regression

Optimization for NN

Inverted Dropout

Negative sampling in word2vec

Word2Vec and skip gram model

Skip gram model

Why word2vec

Role of TF-IDF

There are two more things:

CBOW

References :

Derivation of backpropogation

Derivation Of Backpropagation – 2

Deep learning taking off