KL Divergence

We need to understand this four terms from information theory:

  1. Self Information
  2. Entropy
  3. KL Divergence
  4. Cross Entropy

Self Information

  • Information conveyed by any event
  • Higher the probability lower the information conveyedScreenshot 2019-11-24 at 4.38.31 PM

Entropy

  • Self information is of individual event, entropy is of distribution
  • Entropy is expected information of an event drown from that distribution
  • Distribution that are closer to uniform have highest entropy
    • Distribution that are nearly deterministic (where outcome is almost certain) have lower entropy
  • If log base is 2, it represents no of bites needed on average to encode symbols drawn from distribution of P. However this intuition is used prominently in communication theory than in machine learning.

Screenshot 2019-11-24 at 4.38.37 PM

KL Divergence

  • If we have two distribution P and Q of same random variable x, it tell how different this two distributions are

Screenshot 2019-11-24 at 4.38.43 PM

  • Extra amount of information (bits in base 2) needed to send a message containing symbols from P, while encoding was design for Q
  • KL divergence is always positive
    • It can be greater than 1
    • Bits required to encode information can be greater than 1
  • Example at [1]:
    • we have observed few events and we have the observed distribution
    • We want to represent is by some standard distribution say uniform or binomial.
    • Which one we should choose ?
      • The one for which KL divergence is minimum
      • This would same as to say the one for which extra information is minimum.
    • Binomial distribution has parameter p (probability of event = 1). Which value of this parameter we should choose ?
      • The one for which KL divergence is minimum.
      • It thus becomes an optimization problem
  • KL divergence is sometime termed as distance between two distribution
    • But it is not symmetric KL(P|Q) is not same as KL(Q|P)
  • Application :
    • One use case is in variational auto encoders (VAE) [2]
      • VAE generates sample data like GAN
      • For that we want output of encoder to be more generic before giving it to decoder
        • We measure this by measuring KL divergence of encoder output and uniform distribution
        • We add it in the final loss function

Cross Entropy

  • KL divergence measure extra information (bits) needed to encode P with symbols optimised for Q
  • Cross entropy measures total information needed to encode P with symbols optimised for Q
  • Formula for log-loss is exactly same (it is also called cross entropy loss)

Screenshot 2019-11-24 at 4.38.59 PM

Screenshot 2019-11-24 at 4.38.55 PM

Related

  • KS Test is used for goodness of fit
    • It is formulated in terms of hypothesis test and give p values
    • Based on empirical cumulative distribution function (empirical CDF)
  • KL divergence is differentiable
    • Popular in machine learning loss function
    • Based on information theory

References:

[0] : Deep Learning by Ian Goodfellow, http://www.deeplearningbook.org/

[1] : https://www.countbayesie.com/blog/2017/5/9/kullback-leibler-divergence-explained

[2]: https://towardsdatascience.com/intuitively-understanding-variational-autoencoders-1bfe67eb5daf

Leave a comment