Time series week 1

  • Plotting in R
  • Linear regression properly fitted or not
    • Residue are important thing to observed
    • Q-Q plots for normality test
    • Residues over time
      • Zoomed in residues over time
  • Hypothesis test
    • One, two sided t test
    • Confidence interval
      • Where we think mean lies
      • If it dose not contain 0 we tend to reject null hypothesis (Very broad statement, but I think you got the concept)
  • Correlation function
    • Which quarter data false

 

Ref : https://www.coursera.org/learn/practical-time-series-analysis/home/welcome

 

 

 

Parametric and Nonparametric tests

We rarely heard of nonparametric tests while reading standard statistical books. However there are some scenarios where they should be used instead of parametric tests. [1] has beautiful blog about it, I am putting just a summary from that.

 

Different Tests

Table below displays various tests, I have verified that all of these tests are available in python stats package.

tests_1

When to Use Parametric Tests

  • Parametric tests can perform well with skewed and nonnormal distributions
    • It is important to follow guidance in the sample size of data as shown in table below
  • Parametric tests can perform well when the spread/variance of each group is different
  • It has Statistical power

tests_2

Reasons to Use Nonparametric Tests

  • Your area of study is better represented by the median
    • Income distribution is skewed and median is more useful than mean
    • Few billionaires can boost up the mean significantly
  • You have a very small sample size
    • Even less than what is mentioned in table above
  • You have ordinal data, ranked data, or outliers that you can’t remove

 

 

References:

[1] : http://blog.minitab.com/blog/adventures-in-statistics-2/choosing-between-a-nonparametric-test-and-a-parametric-test

On multivariate Gaussian

Formulas

Formula for multivariate gaussian distribution

g1

Formula of univariate gaussian distribution

g2

Notes:

  • There is normality constant in both equations
  • Σ being a positive definite ensure quadratic bowl is downwards
  • σ2 also being positive ensure that parabola is downwards

On Covariance Matrix

Definition of covariance between two vectors:

g3

When we have more than two variable we present them in matrix form. So covariance matrix will look like

g4

  • Above is very similar to how we compute sigma^2 in 1-D = (x – mu)^2
  • Formula of multivariate gaussian distribution demands Σ to be singular and symmetric positive semidefinite, which in terms means sigma will be symmetric positive semidefinite.
  • For some data above demands might not meet

Side Note

  • Covariance is directional measure
  • Correlation is scaled measure
    • We normalise by individual variance

Derivations

Following derivations  are available at [0]:

  • We can prove[0] that when covariance matrix is diagonal (i.e there is variables are independent) multivariate gaussian distribution is simply multiplication of single gaussian distribution of each variable.
  • It was derived that shape of isocontours (figure 1) is elliptical and axis length is proportional to individual variance of that variable
  • Above is true even when covariance matrix is not diagonal and for dimension n>2 (ellipsoids)

g5

Notes and example of bi-variant Gaussian

https://github.com/arcarchit/datastories/blob/master/notes/bivariant_gaussian.pdf

First part above says that bi-variant destitution can be generated from two standard normal distribution z = N(0,1).

For any given k-variant Gaussian we can represent it as linear combination of k standard normal distribution. One simpler way to find these coefficient is Cholesky decomposition. Theorem 1 below stats the same thing.

This has a reference from [1].

Linear Transformation Interpretation

g6

This was proved in two steps [0]:

Step-1 : Factorizing covariance matrix

g7

Step-2 : Change of variables, which we apply to density function

g8

On Practical Example

Height, wight and waist size of men in US (Of course it weight can be negative, so it is approximately normal)

References

[0] http://cs229.stanford.edu/section/gaussians.pdf

[1] https://www2.stat.duke.edu/courses/Spring12/sta104.1/Lectures/Lec22.pdf

Probability Rules and Tricky Questions

Introduction:

Understanding probability rules and solving tricky probability questions can be challenging. In this blog post, we will explore key probability rules and discuss solutions to some intriguing questions.

Probability Rules:

  1. Joint Distribution: The probability of events X and Y occurring together is denoted as p(X, Y) and is known as the joint distribution.
  2. Conditional Distribution: The probability of event X given event Y is denoted as p(X/Y) and is known as the conditional distribution.
  3. Marginal Distribution: The probability of event X, with event Y marginalized out, is denoted as p(X) and is known as the marginal distribution.

Operations:

  • Making Conditional Distribution: To obtain the conditional distribution, normalization is required.
  • Marginalization: Marginalization does not require normalization.

Note: It is not possible to derive the conditional distribution from the joint distribution solely through integration. There is no direct relationship between them.

There are just two rules for probability. Sum rule and product rules. And then there is Bayes theorem. Bayes theorem can be derived from product rule and the fact that P(x,y) = P(y,x)

p1
p2
p3

We might want to look at a table like below and calculate joint and conditional distribution and marginalized out one of the variable. [1]

prob

Probability Tricky Question

This questions are taken from [2]. One key to solve this question is write down the sample space and keep eliminating choices. Don’t conclude in hurry.

Q1 :  A man comes up to you on the street and says: I have two children. At least one of them is a boy. What is the probability that the other child is also a boy?

Q2 : I have two kids, what are the odds I have 2 boys?

Q3 : A man comes up to you on the street and says: I have two children. The older one is a boy. What is the probability that the other child is also a boy?

Q4 : A man comes up to you on the street and says: I have two children. One is the boy standing here next to me. What is the probability that the other child is also a boy?

Q5 : Q. A man comes up to you on the street and says: I have two children. One of them is a boy who was born in the summer. What is the probability that the other child is also a boy? (There are four seasons : spring, summer, fall, winter)[0]

Ans1 : (1/3)

  • P(BG) is 1/2 and p(BB) = 1/4 in the universe

Ans2 : (1/4)

Ans3 : (1/2)

Ans4 : (1/2)

Ans5 : (7/15) [0]

Compare Q1 and Q5. Odd increases. Being born in summer is rare thing. If that rare thing has occurred there are higher chances of having two boys.

A bag contains (x) one rupee coins and (y) 50 paise coins. Four coins are taken from the bag and put away.
If a coin is now taken at random from the bag, what is the probability that it is a one rupee coin?

Ans is x/(x+y). It will remain same if we take either 1/2/3/4/5 coins because we don’t know which coin has been withdrawn. It is like trying out all possibilities and when we sum, it would come out as 1 only. [4]

The probability of a car passing a certain intersection in a 20 minute windows is 0.9. What is the probability of a car passing the intersection in a 5 minute window? (Assuming a constant probability throughout)

Ans : 0.4377 [5]

Independent Events

  • Mutually exclusive events means dependent event
  • For independent event = P(A/B) = P(A)
  • For mutually exclusive event if we know B has occurred, A will never occur.
independent_event

If two random variables, X and Y, are independent, they satisfy the following conditions.

  • P(x|y) = P(x), for all values of X and Y.
  • P(X, Y) = P(x y) = P(x) * P(y), for all values of X and Y.

Here is an example from [6]. Ans is that X and Y are independent, A and B are not.

prob_independent
prod_inde

Further reading : 

[0] : https://math.stackexchange.com/questions/198713/why-is-the-probability-of-having-2-boys-7-15

[1]:https://www.coursera.org/learn/probabilistic-graphical-models/lecture/slSLb/distributions

[2] : http://adit.io/posts/2017-12-05-A-Mind-Boggling-Probability-Problem.html

[3] : https://en.wikipedia.org/wiki/Boy_or_Girl_paradox

[4] : http://moorthythanu.blogspot.com/2016/03/probability-of-getting-one-rupee-coin.html

[5] : https://math.stackexchange.com/questions/1016268/probability-of-crossing-a-point-in-a-given-time-window

[6] : https://stattrek.com/random-variable/independence.aspx

Hypothesis and T-Distribution

We calculate the t-score using hypothesis data, which also provides us with the degrees of freedom. This value is then supplied to a function that gives us the probability of the hypothesis being true.

The t-test can be seen as a ratio, similar to a signal-to-noise ratio. The numerator allows us to center it around zero, while the denominator represents the standard error of the mean (SEM) calculated as s/sqrt(n), where s is the standard deviation of the samples.

The t-score indicates how many SEM the current mean is away from the mean given in the hypothesis. If it is far away, it suggests a low probability of the null-hypothesis-mean being true, leading us to reject the null hypothesis.

In engineering, we typically assume that the mean and standard deviation are given and true, and we compute the probability of observing the sample. However, in hypothesis testing with a small number of samples, we are testing whether the given mean is true or not.

To address this, we need a distribution that adjusts itself based on the number of observations, widening when there are fewer samples. The t-distribution serves this purpose, as it is dependent on the sample size.

There are different types of t-tests:

  • One-sample t-test: Compares the mean of a sample with a known population mean.
    • Discussion so far is for one sample test
  • Two-sample t-test: Compares the means of two independent groups.
    • To compare means of two independent groups
    • Scores of student who get 8 hour sleep vs four hour sleep
    • Question we want to answer is are there any significant difference in there scores?
    • In one sample test (In numerator of t-score) we are comparing sample mean with population mean
    • In two sample test it compares means of two independently drawn sample
    • And in denominator as well SEM formula is modified    
    • Example
      • A/B testing on e-commerce site where you compare CTR before and after
        • This is two sample because you don’t have standard value of CTR before the feature
        • Even you will see some difference in AA test
  • Paired t-test: Compares the means of two conditions using the same samples.
    • This is essentially a one-sample t-test on the differences between values at two conditions.
    • Same samples are used in two different conditions
    • 10 people before medication and same 10 people after medication
      • We want to check if medication has any effect
    • Different time points are used for market calculation
    • This essentially is a one sample T-test on the differences of value at two different conditions
    • Example
      • Interleaving test in e-commence search system
      • For each search page you will assign some score to control and variant
  • One-sided t-test: Tests a hypothesis in one direction (e.g., weight of dairy milk is less than 100g).
  • Two-sided t-test: Tests a hypothesis in both directions (e.g., weight of dairy milk is not equal to 100g).

P-values represent the probability of finding the observed or more extreme results when the null hypothesis is true. It is described in terms of rejecting the null hypothesis when it is actually true, but it is not a direct probability of this state.

For further examples and details, you can refer to the following link: Example Link

Q – Q Plots

update :

Q-Q plots can be used to test goodness of fit for any distribution. There are p-p plots and other method as well.

https://stats.stackexchange.com/questions/132652/how-to-determine-which-distribution-fits-my-data-best

There is also a P-P plot.

https://en.wikipedia.org/wiki/P%E2%80%93P_plot


Since few days I was coming across to Q-Q plots very often and thought to learn more about it.

Full form is Quartile-Quantile plot.

Many a times we want our data to be normal, this is because we normality is an assumption behind many statistical models. Now how to test normality. Wikipedia has an article about this which lists many method, one of them is Q-Q plots.

Here is how to create Q-Q plot manually (This steps will show the theory behind it):

  1. Sort your samples (Call it Y). Let n be no of sample. n = len(Y)
  2. Find n values(quartiles) from standard normal distribution to divide it in (n+1) equal areas
    1. Standard normal distribution is a distribution with mean = 0 and standard deviation = 1
  3. Call above X
  4. Plot Y against X
  5. For normal distribution it would be approximately straight line
    1. However considering probability, outliers and no of smaples have role to play

Here is the example code and plots in pythons:

maxresdefaultSnLhAm

Reference :

  1. You tube video 

Probability Distribution

We have learned various probability distribution during high school and  engineering courses. However at times we forget them, so here I am providing simple practical scenarios for each distribution with no theories involved.

Bernoulli Distribution

  • When the random variable has just two outcomes
  • Probability of Drug/Medicine will be approved by government is p = 0.65
    • Probability that it will not approve is 0.35
  • Below formula works when we have probability available, in real life we estimate them from data :
    • Mean = p
    • Variance (Sigma Square) = p*(1-p)
  • Parameters : p
  • Probability evaluation P(x|params) = p if x = 1, (1-p) if x = 0
  • MLE : p = n/N, where n = no of time 1 observed , N = no of experiments
  • MLE = Maximum Likelihood Estimation

Binomial Distribution

  • When you perform the Bernoulli experiment multiple times and want to see how many times certain outcome appears.
  • For example you flip a coin(fair/biased) 10 time and probability that head will appear for x (1, 2, …..10) times.
  • Another more practical example :
    • Suppose oil price can increase by 3 bucks or decreased by 1 buck each day
    • Probability of increasing p = 0.65, and that of decreasing = 0.35
    • What price can we expect after three days
    • Note (Increase, Increase, Decrease) and (Increase, Decrease, Increase) will give same price.
      • (2,1) success -> 2 success 1 failure
  • From another point of view it count no of successes in an experiment :
    • No of patient responding to treatment
    • Binary classification problem (Does not seem correct now, it should be Bernoulli, we take logit and sigmoid)
  • Below formula works when we have probability available, in real life we estimate them from data :
    • n = no of times experience is performed
    • Mean = n*p
    • Variance (Sigma Square) = n*p*(1-p)
  • Example of binomial used in modeling :
  • Parameters : n, p
  • Probability evaluation P(x|params) = nCx * p^x * (1-p)^(1-x)
  • MLE
    • n = no of samples = N
    • p = n/N where n = no of successes
    • Interestingly MLE for binomial and multinomial distribution is very simple

Continue reading “Probability Distribution”

Interpreting Statistical Values

In this post, we will explore the values in the summary(model) output in R and understand their significance.

Here is a screenshot illustrating the summary:

rsummary

Significance of Residue

  • We desire our residues to be normally distributed and centered around zero.
  • It’s similar to aiming at the bullseye on a dartboard.
    • If the residues are biased in one direction, there is room for improvement.
    • If the residues are equally biased in all directions, we can attempt to reduce the standard deviation.
    • Irreducible error should be observed in all directions simultaneously.
  • Residues quantile provides an initial insight into symmetry.
  • R also provides the standard deviation of residuals, known as RSE (residue standard error).

The Relationship between t-value and p-value in the Coefficient Section

  • The values test if a variable has a relationship with the output.
  • This is a preset statistical question (null hypothesis) that cannot be changed.
  • If the coefficient is zero, it does not contribute; otherwise, it does.
  • The t-value indicates the number of standard deviations the mean is away from zero.
  • A larger t-value signifies a more significant variable.

Calculating p-values

  • Incorrect thinking: Taking samples from a larger population.
  • Each sample yields a different coefficient, which can be zero for some samples.
  • The variance of the estimated parameter can be mathematically derived using (X^T * X)-1 with σ2.
  • σ2 can be obtained from the residue error.
  • Bayesian view helps appreciate the distribution of coefficients rather than point estimation.
    • P-values can be calculated naturally using the T-distribution, as there are no assumptions.
  • In the R result display, we have a mean and standard deviation.
    • The coefficient is a probabilistic variable centered at the mean (Estimate in R summary).
    • The mean is t standard deviations away from zero.
    • The p-value represents the probability of observing a coefficient beyond t standard deviations from the mean.

formulas

Role of R^2

  • R^2 indicates how much of the variance is explained by the model. Refer to formulas above for a better understanding.
  • R^2 has an advantage over RSE as it always falls between 0 and 1.

Determining a Good Value of R^2

  • A good value of R^2 depends on the problem setting.
  • When we make perfect predictions, RSS = 0 and hence R^2 = 1
  • In physics, if we are confident the data follows a linear model, R^2 close to 1 is desirable.
  • In marketing, a small proportion of the variance can be explained by predictors, so R^2 = 0.1 can be realistic.

Difference between Absolute and Adjusted R^2

  • R^2 always increases with the number of variables, while adjusted R^2 decreases if the added variable is not significant.
  • The formula of adjusted R^2 incorporates the number of variables, so when a non-significant variable is added, the result decreases.
  • The formulas below illustrate that RSE may increase while RSS decreases, but they are not directly related to R^2.
adR2.PNG

Significance of F Statistics

  • The F-test determines if a group of variables is jointly significant, whereas the t-test examines the significance of individual variables.
  • F-statistics also have associated p-values.
  • The null hypothesis for the F-test is that the intercept-only model and your model are equal.
  • While R-squared provides an estimate of the relationship strength between the model and response variable, it does not offer a formal hypothesis test. This test is provided by the F-test.

Why Use F Statistics when Individual Coefficient p-values are Available?

  • It may seem that if one coefficient is significant (good p-value), the overall model will also be significant.
  • However, this assumption breaks down when the number of variables with poor p-values is large.

Determining Good Values of F-statistics

  • It depends on the values of n (number of observations in the training set) and p (number of independent variables).
  • When n is large, an F-value slightly greater than 1 is sufficient to reject the null hypothesis.
  • It is advisable to base decisions on corresponding p-values, which consider both n and p.

Degrees of Freedom:

  • Suppose you have two features, x1 and x2, and a target variable y.
  • The line equation is y = a1x1 + a2x2 + a3.
  • In a 3D space, three points define a unique line.
  • With n points, p(2) features, and 1 target, three points will always lie on the line, while (n-p-1) points can deviate from it. This difference represents the degrees of freedom.
  • Degrees of freedom are the difference between n and the number of non-zero coefficients, including the intercept.

Significance Score “***” in the Coefficient Section

  • R indicates the significance of a p-value by displaying stars.
  • The calculation of this value is likely done through bootstrapping.
  • Bootstrapping allows assigning measures of accuracy to sample estimates, such as bias, variance, confidence intervals, or prediction error.
  • In Bayesian inference, parameter distributions are obtained, allowing the calculation of p-values.

References

Found the formula for adjusted R2 here