Naive Bayes Classifier

December 4, 2018May 17, 2023Archit Vora 2 Comments

There are two things[1]
- Probability model
- Classification model

Probability Model

A probability model is an extension of Bayes’ rule. It makes two assumptions:

Independence of Features: This assumption assumes that all features are independent of each other. However, it does not hold true in many cases. For example, having higher temperature does not necessarily imply higher humidity.
Equal Weight of Features: This assumption assumes that all features have equal importance or weight in the model.

Classification Model

The classification model involves the following steps:

Probability of Each Class: P(y) represents the probability of each class based on the training set.
Probability Estimation of Feature Values: The goal is to estimate the probability distribution of each feature value given a specific class, denoted as P(x_i|y). For discrete features, this can be achieved through simple probability calculations, such as multinomial Naive Bayes. For continuous features, Gaussian distributions can be used. In the case of count data, multinomial distributions are suitable.
Parameter Estimation: Parameter estimation is performed for each combination of class and feature.
Scikit-learn and Distribution Types: Scikit-learn library provides implementations of Gaussian Naive Bayes, Bernoulli Naive Bayes, and multinomial Naive Bayes classifiers. These classifiers refer to the distribution of features. It is important to note that different features can follow different distributions. Therefore, customization of the distribution based on the application may be necessary.

Advantages

Fast and Easy Implementation: Naive Bayes classifiers are known for their simplicity and efficiency in implementation.
Acceptable Classification Performance: While Naive Bayes classifiers may not always accurately predict probabilities, their classification performance is generally satisfactory.

Disadvantage

Independence Assumption: The assumption of feature independence does not hold true in all scenarios, which can affect the model’s accuracy.

Reference

[0] https://towardsdatascience.com/naive-bayes-classifier-81d512f50a7c

[1] https://en.wikipedia.org/wiki/Naive_Bayes_classifier

Exponential Family

September 18, 2018September 18, 2018Archit Vora Leave a comment

Here is the basic concept :

simple_exponential

θ are parameters and X is data, both can be multidimensional
We want to restrict terms inside exponential to the form θ*X

Formal Definition:

formal_defination

η and T functions also help for the case when there is a mismatch in the dimension of θ and X.
g(θ) in basic concept above has been taken into exponential as B(θ).
- It vaguely serves as normalization factor.
h(x) serves a distribution and exponential transfer this basic distribution

examples

Thompson Sampling

September 2, 2017January 1, 2021Archit Vora 1 Comment

Thompson sampling is one approach for Multi Armed Bandits problem and about the Exploration-Exploitation dilemma faced in reinforcement learning. It is also know as posterior sampling.

Challenge in solving such a problem is that we might end up fetching the same arm again and again. Bayesian approach helps us solving this dilemma by setting prior with somewhat high variance.

Here is the code for two armed bandit. One has success probability of 40% (bandit 0) and another has 25% (bandit 1).

We are using beta distribution for deciding which arm to pull. Beta distribution has two parameter alpha and beta. Higher values of alpha, pulls distribution towards 1. Beta distribution is always confined between 0 and 1.

How we train is that for each feedback we receive we increment alpha by 1 if it was success or beta by 1 in case of failure. For choosing the arm we draw random sample from the distribution of each arm and select the arm with highest value.

	import numpy as np
	import scipy.stats as stats
	from matplotlib import pyplot as plt

	class BetaThompson:
	def __init__(self, num_bandits, prior_a, prior_b):
	self.num_bandits = num_bandits
	self.a = prior_a
	self.b = prior_b

	def learn(self, bandit, success):
	if success:
	self.a[bandit]+=1
	else:
	self.b[bandit]+=1

	def suggest(self):
	sampled_prob = []
	for i in range(self.num_bandits):
	dist = stats.beta(self.a[i], self.b[i])
	prob=dist.rvs()
	sampled_prob+=[prob]
	return sampled_prob.index(max(sampled_prob))

	class Gamble:
	def __init__(self, num_bandits, binom_mean):
	self.num_bandits=num_bandits
	self.binom_mean=binom_mean

	def gamble(self, bandit_no):
	dist=stats.binom(n=1,p=self.binom_mean[bandit_no])
	success = True if dist.rvs()>0.5 else False
	return success

	def main():
	g = Gamble(2, [0.40, 0.25])
	bot = BetaThompson(2, [1,1], [1,1])
	choice=[]
	for i in range(5000):
	suggestion = bot.suggest()
	result = g.gamble(suggestion)
	bot.learn(suggestion, result)
	choice+=[suggestion]

	plt.figure(figsize=(12,5))
	plt.plot(choice, 'b.')
	plt.ylim([-0.5,1.5])
	plt.title('Choice made (Bandit 0-0.40 prob and bandit 1-0.25 prob)')
	plt.xlabel('Draw #')
	plt.ylabel('Bandit (0 or 1)')
	plt.savefig('choice.png')

	if __name__== "__main__":
	main()

view raw thompson_sampling.py hosted with ❤ by GitHub

And here is simulation results. We see that initially both the the armed are pulled frequently but slowly arm 1 is pulled less and less, but it is never straight away zero.

Math Behind Increasing alpha and beta

In one line posterior is beta distribution for beta prior and Bernoulli likelihood. In other words beta is a conjugate prior for Bernoulli likelihood. List of various conjugate prior is available at [1].

PDF of beta distribution is simple once you think in terms of the effect of alpha and beta.

Look at note below, formula of beta distribution is not complex, actually it is very similar to

References

[1] : https://ocw.mit.edu/courses/mathematics/18-05-introduction-to-probability-and-statistics-spring-2014/readings/MIT18_05S14_Reading15a.pdf

On multivariate Gaussian

June 23, 2017May 16, 2023Archit Vora Leave a comment

Formulas

Formula for multivariate gaussian distribution

Formula of univariate gaussian distribution

Notes:

There is normality constant in both equations
Σ being a positive definite ensure quadratic bowl is downwards
σ2 also being positive ensure that parabola is downwards

On Covariance Matrix

Definition of covariance between two vectors:

When we have more than two variable we present them in matrix form. So covariance matrix will look like

Above is very similar to how we compute sigma^2 in 1-D = (x – mu)^2
Formula of multivariate gaussian distribution demands Σ to be singular and symmetric positive semidefinite, which in terms means sigma will be symmetric positive semidefinite.
For some data above demands might not meet

Side Note

Covariance is directional measure
Correlation is scaled measure
- We normalise by individual variance

Derivations

Following derivations are available at [0]:

We can prove[0] that when covariance matrix is diagonal (i.e there is variables are independent) multivariate gaussian distribution is simply multiplication of single gaussian distribution of each variable.
It was derived that shape of isocontours (figure 1) is elliptical and axis length is proportional to individual variance of that variable
Above is true even when covariance matrix is not diagonal and for dimension n>2 (ellipsoids)

Notes and example of bi-variant Gaussian

https://github.com/arcarchit/datastories/blob/master/notes/bivariant_gaussian.pdf

First part above says that bi-variant destitution can be generated from two standard normal distribution z = N(0,1).

For any given k-variant Gaussian we can represent it as linear combination of k standard normal distribution. One simpler way to find these coefficient is Cholesky decomposition. Theorem 1 below stats the same thing.

This has a reference from [1].

Linear Transformation Interpretation

This was proved in two steps [0]:

Step-1 : Factorizing covariance matrix

Step-2 : Change of variables, which we apply to density function

On Practical Example

Height, wight and waist size of men in US (Of course it weight can be negative, so it is approximately normal)

References

[0] http://cs229.stanford.edu/section/gaussians.pdf

[1] https://www2.stat.duke.edu/courses/Spring12/sta104.1/Lectures/Lec22.pdf

Probability Rules and Tricky Questions

June 21, 2017June 13, 2026Archit Vora Leave a comment

Introduction:

Understanding probability rules and solving tricky probability questions can be challenging. In this blog post, we will explore key probability rules and discuss solutions to some intriguing questions.

Probability Rules:

Joint Distribution: The probability of events X and Y occurring together is denoted as p(X, Y) and is known as the joint distribution.
Conditional Distribution: The probability of event X given event Y is denoted as p(X/Y) and is known as the conditional distribution.
Marginal Distribution: The probability of event X, with event Y marginalized out, is denoted as p(X) and is known as the marginal distribution.

Operations:

Making Conditional Distribution: To obtain the conditional distribution, normalization is required.
Marginalization: Marginalization does not require normalization.

Note: It is not possible to derive the conditional distribution from the joint distribution solely through integration. There is no direct relationship between them.

There are just two rules for probability. Sum rule and product rules. And then there is Bayes theorem. Bayes theorem can be derived from product rule and the fact that P(x,y) = P(y,x)

We might want to look at a table like below and calculate joint and conditional distribution and marginalized out one of the variable. [1]

Probability Tricky Question

This questions are taken from [2]. One key to solve this question is write down the sample space and keep eliminating choices. Don’t conclude in hurry.

Q1 : A man comes up to you on the street and says: I have two children. At least one of them is a boy. What is the probability that the other child is also a boy?

Q2 : I have two kids, what are the odds I have 2 boys?

Q3 : A man comes up to you on the street and says: I have two children. The older one is a boy. What is the probability that the other child is also a boy?

Q4 : A man comes up to you on the street and says: I have two children. One is the boy standing here next to me. What is the probability that the other child is also a boy?

Q5 : Q. A man comes up to you on the street and says: I have two children. One of them is a boy who was born in the summer. What is the probability that the other child is also a boy? (There are four seasons : spring, summer, fall, winter)[0]

Ans1 : (1/3)

P(BG) is 1/2 and p(BB) = 1/4 in the universe

Ans2 : (1/4)

Ans3 : (1/2)

Ans4 : (1/2)

Ans5 : (7/15) [0]

Compare Q1 and Q5. Odd increases. Being born in summer is rare thing. If that rare thing has occurred there are higher chances of having two boys.

A bag contains (x) one rupee coins and (y) 50 paise coins. Four coins are taken from the bag and put away.
If a coin is now taken at random from the bag, what is the probability that it is a one rupee coin?

Ans is x/(x+y). It will remain same if we take either 1/2/3/4/5 coins because we don’t know which coin has been withdrawn. It is like trying out all possibilities and when we sum, it would come out as 1 only. [4]

The probability of a car passing a certain intersection in a 20 minute windows is 0.9. What is the probability of a car passing the intersection in a 5 minute window? (Assuming a constant probability throughout)

Ans : 0.4377 [5]

Independent Events

Mutually exclusive events means dependent event
For independent event = P(A/B) = P(A)
For mutually exclusive event if we know B has occurred, A will never occur.

If two random variables, X and Y, are independent, they satisfy the following conditions.

P(x|y) = P(x), for all values of X and Y.

P(X, Y) = P(x ∩ y) = P(x) * P(y), for all values of X and Y.

Here is an example from [6]. Ans is that X and Y are independent, A and B are not.

Further reading :

[0] : https://math.stackexchange.com/questions/198713/why-is-the-probability-of-having-2-boys-7-15

[1]:https://www.coursera.org/learn/probabilistic-graphical-models/lecture/slSLb/distributions

[2] : http://adit.io/posts/2017-12-05-A-Mind-Boggling-Probability-Problem.html

[3] : https://en.wikipedia.org/wiki/Boy_or_Girl_paradox

[4] : http://moorthythanu.blogspot.com/2016/03/probability-of-getting-one-rupee-coin.html

[5] : https://math.stackexchange.com/questions/1016268/probability-of-crossing-a-point-in-a-given-time-window

[6] : https://stattrek.com/random-variable/independence.aspx

Data Stories

Tag probability

Naive Bayes Classifier

Probability Model

Classification Model

Advantages

Disadvantage

Reference

Exponential Family

Here is the basic concept :

Formal Definition:

Further Reading

Thompson Sampling

Math Behind Increasing alpha and beta

References

On multivariate Gaussian

Formulas

On Covariance Matrix

Derivations

Notes and example of bi-variant Gaussian

Linear Transformation Interpretation

On Practical Example

References

Probability Rules and Tricky Questions

Probability Tricky Question

Independent Events