KL Divergence

November 24, 2019May 16, 2023Archit Vora

We need to understand this four terms from information theory:

Self Information
Entropy
KL Divergence
Cross Entropy

Self Information

Information conveyed by any event
Higher the probability lower the information conveyed

Entropy

Self information is of individual event, entropy is of distribution
Entropy is expected information of an event drown from that distribution
Distribution that are closer to uniform have highest entropy
- Distribution that are nearly deterministic (where outcome is almost certain) have lower entropy
If log base is 2, it represents no of bites needed on average to encode symbols drawn from distribution of P. However this intuition is used prominently in communication theory than in machine learning.

Screenshot 2019-11-24 at 4.38.37 PM

KL Divergence

If we have two distribution P and Q of same random variable x, it tell how different this two distributions are

Screenshot 2019-11-24 at 4.38.43 PM

Extra amount of information (bits in base 2) needed to send a message containing symbols from P, while encoding was design for Q
KL divergence is always positive
- It can be greater than 1
- Bits required to encode information can be greater than 1
Example at [1]:
- we have observed few events and we have the observed distribution
- We want to represent is by some standard distribution say uniform or binomial.
- Which one we should choose ?
  - The one for which KL divergence is minimum
  - This would same as to say the one for which extra information is minimum.
- Binomial distribution has parameter p (probability of event = 1). Which value of this parameter we should choose ?
  - The one for which KL divergence is minimum.
  - It thus becomes an optimization problem
KL divergence is sometime termed as distance between two distribution
- But it is not symmetric KL(P|Q) is not same as KL(Q|P)
Application :
- One use case is in variational auto encoders (VAE) [2]
  - VAE generates sample data like GAN
  - For that we want output of encoder to be more generic before giving it to decoder
    - We measure this by measuring KL divergence of encoder output and uniform distribution
    - We add it in the final loss function

Cross Entropy

KL divergence measure extra information (bits) needed to encode P with symbols optimised for Q
Cross entropy measures total information needed to encode P with symbols optimised for Q
Formula for log-loss is exactly same (it is also called cross entropy loss)

Related

KS Test is used for goodness of fit
- It is formulated in terms of hypothesis test and give p values
- Based on empirical cumulative distribution function (empirical CDF)
KL divergence is differentiable
- Popular in machine learning loss function
- Based on information theory

References:

[0] : Deep Learning by Ian Goodfellow, http://www.deeplearningbook.org/

[1] : https://www.countbayesie.com/blog/2017/5/9/kullback-leibler-divergence-explained

[2]: https://towardsdatascience.com/intuitively-understanding-variational-autoencoders-1bfe67eb5daf

Leave a comment Cancel reply