From highscalability.com

Highscalability.com has a plethora of case study blogs that I have learned from. Here are some key takeaways:

Deep Learning in Production

  • Batching should be performed at the latest possible stage in the processing chain, specifically for inferencing on GPUs. However, maintaining a certain response time Service Level Agreement (SLA) is essential. While it is not scheduled, we should batch whenever an opportunity arises.
  • Data pipelines in Hive are different from this context as they are scheduled to run daily.
  • Gatekeeping: Limiting the number of requests to 10 at a time.
  • Suppose the inference time of GPT is 50ms (99th percentile), we would guarantee a response time of 5 seconds once the request is accepted. If not accepted, we send an HTTP code 429, indicating “too many requests.” If excessive 429 responses are observed, we can consider spawning new machines.
  • Unsolved problems:
    • Loading input and pre/post-processing tasks consume CPU, while the expensive GPU remains idle during this time.
    • In a pub/sub model, the message injection rate should match the consumption rate.

Uber

  • They shared “What I Wish I Knew (WIWIK),” which primarily focuses on their experience with microservices and the potential downsides.
  • During a talk, the last question was about handling the decoupling of microservices. The answer was that they strive to do their best with engineering practices, but sometimes decoupling challenges still occur.

NN : Batch Norm and Softmax Regression

This post is a lecture summary of Week 3 of the deep learning course by Andrew N. G, available at https://www.coursera.org/learn/deep-neural-network/home/welcome.

Hyperparameter Tuning

  • When the number of parameters is large, random search is better than grid search.
  • Grid search is more useful when the number of parameters is small, as it is more systematic.
  • Not all hyperparameters are equally important.
  • Choosing between babysitting one model (“Panda” strategy) or training multiple models in parallel (“Caviar”) depends on the computational resources available.

Batch Normalization

  • Normalizing input features speeds up learning by making the contours more circular.
  • In practice, z[2] is normalized instead of a[2].
  • β and γ are introduced to allow for non-zero mean and non-unit variance.
    • Suppose you have sigmoid activation and you want larger variance to better exploit non-linearity
  • Different β and γ values are used for each layer.
  • In deep learning frameworks, batch normalization is often a single flag.
  • When using normalization, the bias term (b) has no effect and can be eliminated.

screen shot 2019-01-15 at 11.53.57 am

screen shot 2019-01-15 at 12.06.40 pm

  • Batch normalization speeds up training, limits distribution shifts of activation, provides more consistent data to later layers, and has a slight regularization effect.
  • Mean-variance scaling with different β and γ values for each mini-batch introduces noise similar to dropout, challenging later layers to not depend on a single feature.
  • Larger mini-batch sizes result in smaller regularization, but batch normalization is not primarily used for regularization.
  • During scoring, calculate µ and σ using exponentially weighted averages across mini-batches for each layer during training.
  • These averages are running averages and do not require much memory.
  • Use the above values for scoring.

Softmax Regression

  • Softmax regression is a generalization of logistic regression.
  • The output vector has dimensions (C, 1) and uses the “Softmax activation” function.
  • Softmax activation involves taking exponentials and normalizing the values.
softmax_layer
  • When C = 2, softmax reduces to logistic regression.
  • The loss function remains the same: cross-entropy loss.
  • Only one class will have an actual value of 1, following the maximum likelihood function.
loss_function
  • The gradient of the last layer is dz = ŷ – y.
backprop

Optimization for NN

Mini-batch Gradient Descent:

  • Mini-batch gradient descent exhibits oscillations during descent.
  • Choosing mini-batch size:
    • For small training sets (< 2k samples), it is advisable to use batch gradient descent.
    • For larger training sets, mini-batches of sizes such as 64, 128, 256, or 512 are commonly used.
  • Cross-validation helps in finding the right trade-off.
  • Batch gradient descent: More training time is dominated by the processing of a single duration.
  • Stochastic gradient descent: More training time is dominated by the number of iterations required for convergence.
    • Vectorization is lost in the case of stochastic gradient descent.

Exponentially Weighted Moving Averages:

  • Exponentially weighted moving averages are computed using the formulas:
    • Vₜ = 0.9 * Vₜ₋₁ + 0.1 * θₜ
    • Vₜ = β * Vₜ₋₁ + (1 – β) * θₜ
  • Averaging over roughly the last 10 days of temperature is achieved using the factor 1 / (1 – 0.9).
  • Bias correction is necessary to eliminate the bias introduced when initializing with v₀ = 0.
  • The bias correction formula is: Vₜ = (1 – βᵗ) * (β * Vₜ₋₁ + (1 – β) * θₜ)
bias_correction

Gradient Descent with Momentum:

  • Gradient descent with momentum enables slower learning on the vertical axis and faster learning on the horizontal axis.
  • In practice, bias correction is not used after around 10 iterations.
gradient_descent_with_momentum.png

RMSprop:

  • RMSprop is used to handle situations where some dw values can be large.
  • Adding epsilon for numerical stability helps prevent division by zero.
  • Notice is the dw^2 in the formula below
rmsprop

Adam Optimization:

  • Adam optimization, short for Adaptive Moment Estimation, is one of the algorithms that works well across domains.
  • Default values commonly used for β₁ (0.9), β₂ (0.999), and ε (10^-8)
adam

Learning Rate Decay:

  • 1 epoch refers to one pass through the entire data.
  • In the case of mini-batches, one epoch can involve multiple iterations.
  • Different formulas are used for learning rate decay
learning_rate_decay.png

Local Optima:

  • Most points with zero gradient are not local optima but rather saddle points, especially in high-dimensional spaces.
  • Local optima are generally not observed due to the high dimensionality.
  • Plateaus can be problematic, with very small gradients leading to slower learning.

Inverted Dropout

This post is a lecture summary of the deep learning course by Andrew N. G, available at https://www.coursera.org/learn/deep-neural-network/home/welcome.

During Training:

  • Neurons are dropped out by setting them to zero.
  • The activation is adjusted by dividing it with the keep probability.
  • The expected value of z[4] (as shown in the screenshot below) should not be altered.
inverted_dropout

During Scoring:

  • If “inverted dropout” is used, no additional steps are necessary.
  • Other dropout techniques may require some computations.

Intuition:

  • Dropping out neurons causes inputs to the unit to be randomly dropped.
  • This prevents the unit from relying too heavily on a single feature and encourages it to distribute weights across multiple features.
  • Different layers can have different keep probabilities.

Side Effect:

  • The cost function is not well defined.
  • It’s not possible to check if the cost is consistently decreasing every iteration.
  • A debugging tool is used to address this issue.

Solution:

  • First, verify that everything is functioning correctly without dropout.
  • Then, gradually introduce dropout.

———————————————————————–

Other Regularization Techniques

  • Data augmentation, such as horizontal flipping, random cropping, and transformations.
  • Early stopping: Stop training at a certain iteration (e.g., 7k instead of 10k) based on the error observed on the development set.

Downside

  • Balancing optimization and avoiding overfitting can be challenging.
  • Mixing both objectives requires careful consideration.

Advantage

  • Unlike L2 regularization, dropout does not necessitate trying different lambda values repeatedly.

Deep learning taking off

I recently started Andrew Ng’s specialization on deep learning and found these two interesting points :

One is about how performance of algorithm changes with the amount of data. Traditional algorithms have limits but Deep neural network has more advantages.

whyD

 

Also for the small amount of data traditional algorithms may win over neural nets with good feature engineering.

Second reason is that deep learning requires data, computation and efficient algorithms. Recent years have seen significant advancement in algorithm to increase computation efficiency. For example sigmoid to ReLU was an algorithmic change which allowed gradient to converge faster.

 

Ref : https://www.coursera.org/learn/neural-networks-deep-learning/home