Power Analysis

When we run A/B test, we need to know how long we should keep it running. After 14 days we observe that test is not conclusive ? Should we still keep it running in the hope that we will get evidence in one more week or conclude that treatment is not significant. Power Analysis helps answer this question.

We can build our own guidelines – 5 M visitors or 14 days

Relation with Type 1 and Type 2 Error

  • Type 1 error is probability to wrongly reject null hypothesis
    • We can fix this by reducing p-value from 5 % to say 1 %
  • Type 2 error is probability of failing to accept alternate hypothesis
    • Alternate hypothesis is correct but we could not accept it
    • We failed to reject null hypothesis
  • Power = 1 – (Type 2 error)
    • Probability that we will reject null hypothesis when we should

4 Parameters Involved in Power Analysis

  • Alpha (p-val)
    • When we increase p-val from 5 % to 10 % power increases
    • We reduce probability of type 2 error
    • But type 1 error increases
  • Power
    • We defined it above
    • Typically we set it to 0.8
  • Number of samples, N
    • We want to find out number sample for which we should keep running tests
    • After which we can conclude it.
  • Effect Size
    • One feature brings 1 % lift in conversion
    • Other feature is brining 5 % lift in conversion
    • 2nd test will converge faster and hence has high power

There is formula connecting above 4 parameters. Once we fix 3 of them, we can find the 4th one.

How can I know effect size before the test ?

  • In most cases it is expected to have pilot data.
  • If not other source would be similar experiments in literature.
  • If none of above is available it is wise to do pilot study.

Types of Power Analysis

There are four types based on which parameter we are trying to find out.

  • A priori analysis
    • compute N given other three
    • Most common analysis
  • Post-hoc power analysis
    • Find power given other three
    • It is wise to do once test is completed
    • While calculating N before test you don’t have exact estimation of effect size, post test you have it
    • This helps you verify if you had sufficient samples to detect statistics you found
  • Criterion analysis
    • Compute alpha given other three
    • Rarely used
  • Sensitivity Analysis
    • Find effect size given other three
    • It is important when sample size is bounded
      • In online a/b test thing generally is not the case
      • For clinical trial it would be
    • For fixed samples you can estimate what effect you would find at max
      • If it is too less, don’t conduct an experiment
    • This is known as minimum detectable effect (MDE).

Calculation of effect size and power analysis formula

  • It is beyond scope of this post right now. Will add it in future.

Central limit theorem

What does CLT says ?

  • Sum of random samples forms normal distribution
    • This samples may not come from normal distribution
  • Sum forming random distribution implies that mean would also form normal distribution

Straight facts

  • Central limit theorem helps getting confidence interval for parameters
  • It works for all distributions when n > 30
  • For normal distribution it works even if n < 30
  • Why do we need to have distribution
    • To make variance estimation stable
    • We want to have just one unknown that is mean
    • We need to test normality of samples before applying t-test

Slide from MIT course : https://ocw.mit.edu/courses/mathematics/18-650-statistics-for-applications-fall-2016/

sigma / (sqrt(n)) is standard error of mean. We are saying this distribution reaches to standard normal distribution.

Law of Larger Number

  • As a sample size grows, its mean gets closer to the average of the whole population. This is due to the sample being more representative of the population

Example:

  • During significance testing we calculate left hand side. For examples testing fairness of coin that number comes out to be 3.54. Now for standard normal 3*sigma = 3*1 = 3 is 99 % of area. We are further away than it. So we can reject null hypothesis. [1]
  • Thing to understand is that distribution of Bernoulli parameter(p) is normal.
  • We are not saying how far observed mean is from 0.5 in Bernoulli distribution. If we were doing that we would not have used sqrt(n).
    • Also more importantly Bernoulli can take only two values 0 and 1. From that perspective as well it does not make sense.
    • See the equation in the slide below in central limit theorem. It is a normal distribution N(0,1).

Refereces

[0] : Slide from MIT course : https://ocw.mit.edu/courses/mathematics/18-650-statistics-for-applications-fall-2016/

[1] : https://ocw.mit.edu/courses/18-650-statistics-for-applications-fall-2016/resources/mit18_650f16_parametric_ht/

Chi Square Test

The chisquare independence test is a procedure for testing if two categorical variables are related in some population.

Here is handwritten example : https://github.com/arcarchit/datastories/blob/master/notes/chi2.pdf

Chi square distribution

  • Chi square distribution
    • Squaring samples from standard normal distribution [0]
    • Distribution changes with degrees of freedom
    • When DoF = 1 it is more concentrated around 0
  • It is distribution is sum of squares
    • When dice is biased sum of squares will be higher. Hence more significant.
    • When it is fair it will be closed to zero. Difference is with expected value.
d1
d2

Chi Square Test for Equality of Proportions

h1

Chi square vs T test

  • When to use which one
  • T-test is used to compare mean of two distributions
  • Chi square is used to check whether observation gathered of categorical data meets the assumption

Chi Square for goodness of fit testing

  • Chi Square Goodness of fit
    • Restaurant example
    • H0 = Percentage given by customer is correct
  • We calculate expected for each cell and calculate chi^2

Chi Square for relationship testing

  • H0 : Variables are independent of each other
  • It helps testing if two categorical variables are related
  • Calculate Chi square statistics by summing all cells and check against degree’s of freedom
  • Examples
    • Hypothesis testing :
      • H0 = Herbs1, Herb2, placebo are same
      • H0 = Herbs do nothing
      • We can’t say herb does nothing
        • We are working on accumulated data here
        • Whereas ANOVA is about variancei1
    • Homogeneity testing  :
      • H0 = Left and Right handed people have same preference for arts, science
      • H0 = Preference of arts/science is independent of natural hand left/right
      • H0 = Variables are independent
      • Filling up table
        • P(STEM | right) = P(STEM)
        • x / 60 = 40/100 => x = 40 * 60 / 100 = 24
        • We can also say that value of cell is product of marginals divide by total
      • Degrees of freedom = (r-1)*(c-1)  = 2 * 1 = 2
i3

References

[0] : https://www.khanacademy.org/math/statistics-probability/inference-categorical-data-chi-square-tests#chi-square-goodness-of-fit-tests

[1] : https://www.khanacademy.org/math/statistics-probability/inference-categorical-data-chi-square-tests#chi-square-goodness-of-fit-tests

[2] : https://biology.stackexchange.com/questions/13486/deciding-between-chi-square-and-t-test

[3] : https://fhssrsc.byu.edu/SitePages/ANOVA,%20t-tests,%20Regression,%20and%20Chi%20Square.aspx

Contrast Analysis

  • At Hypothesis and T-Distribution we discussed about hypothesis testing. We had talked about one sample and two sample t test.
  • Contrast analysis is more general case of that.
  • It allows us to make comparison of combination of groups :
  • Can’t we combine them to form just two groups :
    • We want to preserve individual identity of group
      • Group with large no of samples should not dominate group with small no of samples
  • Examples
    • Groups for which the context at test matches the context during learning (i.e., is the same or is simulated by imaging or photography) will perform better than groups with a different or placebo contexts. [2]
    • Group 1 is different from Group 2-3-4
      • 3 types of smile vs 1 neutral [3]
    • Group 1 and 4 are different from group 2,3

 

contrast1       contrast2

  • Intuition behind formula of standard error
    • Sum of square of combined group is some of individual SoS when groups are independent
  • α should be equal to desired confidence (0.90, 0.95 etc). It is divided by two because it is two sided

 

 

Reference

[1] : http://www.youtube.com/watch?v=yq_yTWK4mNs

[2] : https://pdfs.semanticscholar.org/c0ba/1c28b0e120a459820bfb20d430fa442ebd96.pdf

[3] : http://www.onlinestatbook.com/case_studies_rvls/smiles/index.html