Power Analysis

When we run A/B test, we need to know how long we should keep it running. After 14 days we observe that test is not conclusive ? Should we still keep it running in the hope that we will get evidence in one more week or conclude that treatment is not significant. Power Analysis helps answer this question.

We can build our own guidelines – 5 M visitors or 14 days

Relation with Type 1 and Type 2 Error

  • Type 1 error is probability to wrongly reject null hypothesis
    • We can fix this by reducing p-value from 5 % to say 1 %
  • Type 2 error is probability of failing to accept alternate hypothesis
    • Alternate hypothesis is correct but we could not accept it
    • We failed to reject null hypothesis
  • Power = 1 – (Type 2 error)
    • Probability that we will reject null hypothesis when we should

4 Parameters Involved in Power Analysis

  • Alpha (p-val)
    • When we increase p-val from 5 % to 10 % power increases
    • We reduce probability of type 2 error
    • But type 1 error increases
  • Power
    • We defined it above
    • Typically we set it to 0.8
  • Number of samples, N
    • We want to find out number sample for which we should keep running tests
    • After which we can conclude it.
  • Effect Size
    • One feature brings 1 % lift in conversion
    • Other feature is brining 5 % lift in conversion
    • 2nd test will converge faster and hence has high power

There is formula connecting above 4 parameters. Once we fix 3 of them, we can find the 4th one.

How can I know effect size before the test ?

  • In most cases it is expected to have pilot data.
  • If not other source would be similar experiments in literature.
  • If none of above is available it is wise to do pilot study.

Types of Power Analysis

There are four types based on which parameter we are trying to find out.

  • A priori analysis
    • compute N given other three
    • Most common analysis
  • Post-hoc power analysis
    • Find power given other three
    • It is wise to do once test is completed
    • While calculating N before test you don’t have exact estimation of effect size, post test you have it
    • This helps you verify if you had sufficient samples to detect statistics you found
  • Criterion analysis
    • Compute alpha given other three
    • Rarely used
  • Sensitivity Analysis
    • Find effect size given other three
    • It is important when sample size is bounded
      • In online a/b test thing generally is not the case
      • For clinical trial it would be
    • For fixed samples you can estimate what effect you would find at max
      • If it is too less, don’t conduct an experiment
    • This is known as minimum detectable effect (MDE).

Calculation of effect size and power analysis formula

  • It is beyond scope of this post right now. Will add it in future.

Central limit theorem

What does CLT says ?

  • Sum of random samples forms normal distribution
    • This samples may not come from normal distribution
  • Sum forming random distribution implies that mean would also form normal distribution

Straight facts

  • Central limit theorem helps getting confidence interval for parameters
  • It works for all distributions when n > 30
  • For normal distribution it works even if n < 30
  • Why do we need to have distribution
    • To make variance estimation stable
    • We want to have just one unknown that is mean
    • We need to test normality of samples before applying t-test

Slide from MIT course : https://ocw.mit.edu/courses/mathematics/18-650-statistics-for-applications-fall-2016/

sigma / (sqrt(n)) is standard error of mean. We are saying this distribution reaches to standard normal distribution.

Law of Larger Number

  • As a sample size grows, its mean gets closer to the average of the whole population. This is due to the sample being more representative of the population

Example:

  • During significance testing we calculate left hand side. For examples testing fairness of coin that number comes out to be 3.54. Now for standard normal 3*sigma = 3*1 = 3 is 99 % of area. We are further away than it. So we can reject null hypothesis. [1]
  • Thing to understand is that distribution of Bernoulli parameter(p) is normal.
  • We are not saying how far observed mean is from 0.5 in Bernoulli distribution. If we were doing that we would not have used sqrt(n).
    • Also more importantly Bernoulli can take only two values 0 and 1. From that perspective as well it does not make sense.
    • See the equation in the slide below in central limit theorem. It is a normal distribution N(0,1).

Refereces

[0] : Slide from MIT course : https://ocw.mit.edu/courses/mathematics/18-650-statistics-for-applications-fall-2016/

[1] : https://ocw.mit.edu/courses/18-650-statistics-for-applications-fall-2016/resources/mit18_650f16_parametric_ht/

Types of Statistical Studies

We study to learn something new. The word “study” in statistics implies conducting an experiment and analyzing data to learn something new, investigate something, or draw confident conclusions.

Studies are prevalent in medical fields, where people study various types of drugs on different demographics, geographies, and health conditions. This blog essentially contains my notes from the Coursera course: “Clinical Research” (https://www.coursera.org/learn/clinical-research/home/welcome).

Types of Studies:

  1. Observational Studies:
    • Case Series Study: Observes and describes subjects without requiring a research hypothesis. Numbers derived from such studies help remove inherent biases. These studies often serve as initial steps for complex studies.
    • Case Control Studies: Compare two or more groups based on the presence or absence of a disease. These studies look at historical data to identify variables that differ between the groups. Confounding, such as smoking in a study on alcohol and heart attacks, needs to be controlled for.
      • Many times when you are presenting analysis experienced seniors would ask what was the value of this feature in both cases ?
    • Cross-Sectional Studies: Conducted in the form of surveys, gathering data at a specific time. For example, a survey sent to optometrists and ophthalmologists to understand their dietary advice to patients. These studies aim to examine current practices and identify areas for improvement. (Reference: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3695797/)
      • Drawbacks
        • Responsive bias
          • Suppose you are asking questions related to HIV. Positive patients are less likely to answer than negative ones.
          • In govt survey people in cities are more likely to answer than villages
        • Almost impossible to infer causality
          • Since this takes place at a particular place in time, we can not determine whether disease outcome followed exposure or exposure followed disease.
    • Cohort Studies: Identify a group of subjects (cohort) and follow them either backward in history (retrospective cohort) or forward in the future (prospective cohort). Computerized data collection has made retrospective cohorts possible. These studies are observational and do not involve controlling variables.
  2. Experimental Studies (Interventional):
    • In these studies, interventions are implemented to reduce bias inherent in observational studies. There is a control group that receives no intervention (sham/placebo).
    • Some Key Terms:
      • Randomization: Every member of the population should have an equal opportunity to be part of the study, and participants should have an equal chance of being assigned to any group.
      • Blinding: Participants are unaware of their assigned groups. If researchers are also unaware of the groups, it is called a double-blind study. Achieving double-blindness is challenging in surgical operations.

Reservoir sampling

Where it is used

  • Suppose you have a streaming data and you want to randomly sample from it. You don’t know how many items will be coming in.
  • You have large set of items to sample from and you want to do it in single pass

How it is done

  • Suppose you want to sample 1000 items.[1]
  • You take first 1000 items and put it into reservoir
  • Next you will take 1001th item with probability 1000/1001
    • You take a random number and if it is less than 1000/1001, you add this item to reservoir
    • Remember the CDF trick
  • When you add this item, you randomly remove any other item from reservoir

Alternative

  • Pick items from stream, generate a random no and put it in priority queue
  • This is how order by rand() in sql works [1]

References

[1] https://gregable.com/2007/10/reservoir-sampling.html

[2] https://www.youtube.com/watch?v=Ybra0uGEkpM (proof)

ANOVA Introduction

ANOVA (Analysis of Variance) is a statistical technique used to determine if there are significant differences between the means of two or more groups. It is an extension of the t-test, which is used for comparing means between two groups.

The key idea behind ANOVA is to assess whether the data in different groups come from the same distribution or not. It is based on the assumption that data within each group should have low variance, while the variances between the groups should be relatively high.

The null hypothesis in ANOVA states that the means of all the groups are equal. If the groups are well separated and one group appears to be significantly different from the others, the null hypothesis is rejected.

There are different types of ANOVA designs, including one-way ANOVA and two-way ANOVA.

One-way ANOVA involves comparing the means of a single dependent variable across three or more independent groups. For example, you can use one-way ANOVA to analyze the stress levels of employees before the layoff announcement, after the announcement, and during the layoff period.

Two-way ANOVA, on the other hand, involves examining the interaction effects between two independent variables on a dependent variable. For instance, you can use two-way ANOVA to explore the stress levels of men and women before the layoff announcement, after the announcement, and during the layoff period. In two-way ANOVA, there are multiple null hypotheses to test, including whether the stress levels are the same for men and women, whether the stress levels are the same across different time points, and whether there is an interaction effect between gender and time.

The F distribution is utilized in ANOVA, similar to how the t distribution is used in t-tests. The F distribution has different shapes depending on the degrees of freedom for the numerator and denominator. As the degrees of freedom increase, the F distribution becomes more concentrated. It ranges from 0 to infinity, denoted as [0, infinity).

I have coded a notebook for calculating one way anova manually in python [1]

ANCOVA, MANOVA, MANCOVA

ANCOVA (Analysis of Covariance), MANOVA (Multivariate Analysis of Variance), and MANCOVA (Multivariate Analysis of Covariance) are related techniques.

ANCOVA (Analysis of Covariance) is an extension of ANOVA that incorporates a continuous independent variable, referred to as a covariate. The purpose of ANCOVA is to examine whether the relationship between the dependent variable and the independent variable(s) remains significant after controlling for the effect of the covariate. By including the covariate in the analysis, ANCOVA allows for a more accurate assessment of the impact of the independent variables on the dependent variable.

MANOVA (Multivariate Analysis of Variance) is used when there are two or more dependent variables. It examines whether there are significant differences between groups across the multiple dependent variables. MANOVA allows for the analysis of complex relationships among variables and provides a comprehensive assessment of group differences. It is particularly useful when the dependent variables are related or correlated.

MANCOVA (Multivariate Analysis of Covariance) is an extension of MANOVA that incorporates continuous independent variables along with the categorical independent variables. MANCOVA allows for the examination of the effects of both categorical and continuous independent variables on multiple dependent variables while controlling for the influence of covariates. It helps to determine whether the relationships between the independent variables and the dependent variables remain significant after accounting for the effects of covariates.

In summary:

  • ANOVA is used to compare means between three or more groups.
  • ANCOVA extends ANOVA by incorporating a continuous covariate to control for its effect.
  • MANOVA analyzes differences between groups across multiple dependent variables.
  • MANCOVA extends MANOVA by including both categorical and continuous independent variables along with covariates.

These techniques are widely used in research and can provide valuable insights into group differences and relationships among variables.

References:

[0] : https://www.technologynetworks.com/informatics/articles/one-way-vs-two-way-anova-definition-differences-assumptions-and-hypotheses-306553

[1] : https://github.com/arcarchit/datastories/blob/master/ANOVA.ipynb

[2] : http://www.statsmakemecry.com/smmctheblog/stats-soup-anova-ancova-manova-mancova

Interpretation of Multiple Regression Coefficients

  • In case of simple linear regression with one variable we interpret slop coefficient as follows
    • y = b0*x0 + c
    • b0 is increase in y for unit increase in x0
  • In case of multiple regression:
    • y = b0*x0 + b1*x1 + c
    • b0 is increase in y for unit increase in x0 keeping x1 constant

 

Implication of keeping other variable constant:

  • Consider
    • house_price = b0 * no_of_bedrooms + c       ….. (1)
    • house_price = b0 * no_of_bedrooms + b1*square_feet + c     …..(2)
  • In (1) there are high chances that b0 will be positive
  • In (2) it can be negative
    • If you increase no of bedrooms keeping square feet constant each room will be smaller
    • This may decrease house price

 

 

Reference :

Coursera course on regression by University of Washington DC : https://www.coursera.org/specializations/machine-learning

 

Grubb’s Test for Anomaly Detection

Problem Statement:

We are receiving time series of count data everyday and we want to detect whenever there is drastic change in this count.

 

Grubb’s test assumes a t-distribution of input and find out the outliers for required confidence interval. We remove this outlier and repeat the test again. Here is the pseudo code:

 

Grubbs Test(X, p-val=0.05):
    Repeat :
        Z <- zscore(X)
        n < len(X)
        zi, index <- max(abs(Z)), index(max(abs(Z)))
        if zi > threshold(N, p-val):
            remove X[index] 
        else:
            break

 

Traditionally Grubb’s tests has a alternate hypothesis that exactly one outlier is present in data. In above we modified it get all possible outliers.

 

Test Hypothesis
Grubb’s Test H0: There are no outliers in the data set
Ha: There is exactly one outlier in the data
Tietjen-Moore test H0: There are no outliers in the data set
Ha: There are exactly k outliers
Generalized ESD test H0: There are no outliers in the data set
Ha: There are up to r outliers

 

Above can be extended to two sided tests as well.

 

We had followed this in time series based anomaly detection and following approach were considered for pre processing before applying Grubb’s test:

  • Raw Count (No processing)
  • Residuals after STL decomposition
  • Residuals after fitting ARIMA

In our case raw count had worked well enough.

 

Reference:

https://www.itl.nist.gov/div898/handbook/eda/section3/eda35h1.htm

Github Gist

 

Time series week 1

  • Plotting in R
  • Linear regression properly fitted or not
    • Residue are important thing to observed
    • Q-Q plots for normality test
    • Residues over time
      • Zoomed in residues over time
  • Hypothesis test
    • One, two sided t test
    • Confidence interval
      • Where we think mean lies
      • If it dose not contain 0 we tend to reject null hypothesis (Very broad statement, but I think you got the concept)
  • Correlation function
    • Which quarter data false

 

Ref : https://www.coursera.org/learn/practical-time-series-analysis/home/welcome

 

 

 

Parametric and Nonparametric tests

We rarely heard of nonparametric tests while reading standard statistical books. However there are some scenarios where they should be used instead of parametric tests. [1] has beautiful blog about it, I am putting just a summary from that.

 

Different Tests

Table below displays various tests, I have verified that all of these tests are available in python stats package.

tests_1

When to Use Parametric Tests

  • Parametric tests can perform well with skewed and nonnormal distributions
    • It is important to follow guidance in the sample size of data as shown in table below
  • Parametric tests can perform well when the spread/variance of each group is different
  • It has Statistical power

tests_2

Reasons to Use Nonparametric Tests

  • Your area of study is better represented by the median
    • Income distribution is skewed and median is more useful than mean
    • Few billionaires can boost up the mean significantly
  • You have a very small sample size
    • Even less than what is mentioned in table above
  • You have ordinal data, ranked data, or outliers that you can’t remove

 

 

References:

[1] : http://blog.minitab.com/blog/adventures-in-statistics-2/choosing-between-a-nonparametric-test-and-a-parametric-test