Chi Square Test

The chisquare independence test is a procedure for testing if two categorical variables are related in some population.

Here is handwritten example : https://github.com/arcarchit/datastories/blob/master/notes/chi2.pdf

Chi square distribution

  • Chi square distribution
    • Squaring samples from standard normal distribution [0]
    • Distribution changes with degrees of freedom
    • When DoF = 1 it is more concentrated around 0
  • It is distribution is sum of squares
    • When dice is biased sum of squares will be higher. Hence more significant.
    • When it is fair it will be closed to zero. Difference is with expected value.
d1
d2

Chi Square Test for Equality of Proportions

h1

Chi square vs T test

  • When to use which one
  • T-test is used to compare mean of two distributions
  • Chi square is used to check whether observation gathered of categorical data meets the assumption

Chi Square for goodness of fit testing

  • Chi Square Goodness of fit
    • Restaurant example
    • H0 = Percentage given by customer is correct
  • We calculate expected for each cell and calculate chi^2

Chi Square for relationship testing

  • H0 : Variables are independent of each other
  • It helps testing if two categorical variables are related
  • Calculate Chi square statistics by summing all cells and check against degree’s of freedom
  • Examples
    • Hypothesis testing :
      • H0 = Herbs1, Herb2, placebo are same
      • H0 = Herbs do nothing
      • We can’t say herb does nothing
        • We are working on accumulated data here
        • Whereas ANOVA is about variancei1
    • Homogeneity testing  :
      • H0 = Left and Right handed people have same preference for arts, science
      • H0 = Preference of arts/science is independent of natural hand left/right
      • H0 = Variables are independent
      • Filling up table
        • P(STEM | right) = P(STEM)
        • x / 60 = 40/100 => x = 40 * 60 / 100 = 24
        • We can also say that value of cell is product of marginals divide by total
      • Degrees of freedom = (r-1)*(c-1)  = 2 * 1 = 2
i3

References

[0] : https://www.khanacademy.org/math/statistics-probability/inference-categorical-data-chi-square-tests#chi-square-goodness-of-fit-tests

[1] : https://www.khanacademy.org/math/statistics-probability/inference-categorical-data-chi-square-tests#chi-square-goodness-of-fit-tests

[2] : https://biology.stackexchange.com/questions/13486/deciding-between-chi-square-and-t-test

[3] : https://fhssrsc.byu.edu/SitePages/ANOVA,%20t-tests,%20Regression,%20and%20Chi%20Square.aspx

ANOVA Introduction

ANOVA (Analysis of Variance) is a statistical technique used to determine if there are significant differences between the means of two or more groups. It is an extension of the t-test, which is used for comparing means between two groups.

The key idea behind ANOVA is to assess whether the data in different groups come from the same distribution or not. It is based on the assumption that data within each group should have low variance, while the variances between the groups should be relatively high.

The null hypothesis in ANOVA states that the means of all the groups are equal. If the groups are well separated and one group appears to be significantly different from the others, the null hypothesis is rejected.

There are different types of ANOVA designs, including one-way ANOVA and two-way ANOVA.

One-way ANOVA involves comparing the means of a single dependent variable across three or more independent groups. For example, you can use one-way ANOVA to analyze the stress levels of employees before the layoff announcement, after the announcement, and during the layoff period.

Two-way ANOVA, on the other hand, involves examining the interaction effects between two independent variables on a dependent variable. For instance, you can use two-way ANOVA to explore the stress levels of men and women before the layoff announcement, after the announcement, and during the layoff period. In two-way ANOVA, there are multiple null hypotheses to test, including whether the stress levels are the same for men and women, whether the stress levels are the same across different time points, and whether there is an interaction effect between gender and time.

The F distribution is utilized in ANOVA, similar to how the t distribution is used in t-tests. The F distribution has different shapes depending on the degrees of freedom for the numerator and denominator. As the degrees of freedom increase, the F distribution becomes more concentrated. It ranges from 0 to infinity, denoted as [0, infinity).

I have coded a notebook for calculating one way anova manually in python [1]

ANCOVA, MANOVA, MANCOVA

ANCOVA (Analysis of Covariance), MANOVA (Multivariate Analysis of Variance), and MANCOVA (Multivariate Analysis of Covariance) are related techniques.

ANCOVA (Analysis of Covariance) is an extension of ANOVA that incorporates a continuous independent variable, referred to as a covariate. The purpose of ANCOVA is to examine whether the relationship between the dependent variable and the independent variable(s) remains significant after controlling for the effect of the covariate. By including the covariate in the analysis, ANCOVA allows for a more accurate assessment of the impact of the independent variables on the dependent variable.

MANOVA (Multivariate Analysis of Variance) is used when there are two or more dependent variables. It examines whether there are significant differences between groups across the multiple dependent variables. MANOVA allows for the analysis of complex relationships among variables and provides a comprehensive assessment of group differences. It is particularly useful when the dependent variables are related or correlated.

MANCOVA (Multivariate Analysis of Covariance) is an extension of MANOVA that incorporates continuous independent variables along with the categorical independent variables. MANCOVA allows for the examination of the effects of both categorical and continuous independent variables on multiple dependent variables while controlling for the influence of covariates. It helps to determine whether the relationships between the independent variables and the dependent variables remain significant after accounting for the effects of covariates.

In summary:

  • ANOVA is used to compare means between three or more groups.
  • ANCOVA extends ANOVA by incorporating a continuous covariate to control for its effect.
  • MANOVA analyzes differences between groups across multiple dependent variables.
  • MANCOVA extends MANOVA by including both categorical and continuous independent variables along with covariates.

These techniques are widely used in research and can provide valuable insights into group differences and relationships among variables.

References:

[0] : https://www.technologynetworks.com/informatics/articles/one-way-vs-two-way-anova-definition-differences-assumptions-and-hypotheses-306553

[1] : https://github.com/arcarchit/datastories/blob/master/ANOVA.ipynb

[2] : http://www.statsmakemecry.com/smmctheblog/stats-soup-anova-ancova-manova-mancova

Parametric and Nonparametric tests

We rarely heard of nonparametric tests while reading standard statistical books. However there are some scenarios where they should be used instead of parametric tests. [1] has beautiful blog about it, I am putting just a summary from that.

 

Different Tests

Table below displays various tests, I have verified that all of these tests are available in python stats package.

tests_1

When to Use Parametric Tests

  • Parametric tests can perform well with skewed and nonnormal distributions
    • It is important to follow guidance in the sample size of data as shown in table below
  • Parametric tests can perform well when the spread/variance of each group is different
  • It has Statistical power

tests_2

Reasons to Use Nonparametric Tests

  • Your area of study is better represented by the median
    • Income distribution is skewed and median is more useful than mean
    • Few billionaires can boost up the mean significantly
  • You have a very small sample size
    • Even less than what is mentioned in table above
  • You have ordinal data, ranked data, or outliers that you can’t remove

 

 

References:

[1] : http://blog.minitab.com/blog/adventures-in-statistics-2/choosing-between-a-nonparametric-test-and-a-parametric-test

Hypothesis and T-Distribution

We calculate the t-score using hypothesis data, which also provides us with the degrees of freedom. This value is then supplied to a function that gives us the probability of the hypothesis being true.

The t-test can be seen as a ratio, similar to a signal-to-noise ratio. The numerator allows us to center it around zero, while the denominator represents the standard error of the mean (SEM) calculated as s/sqrt(n), where s is the standard deviation of the samples.

The t-score indicates how many SEM the current mean is away from the mean given in the hypothesis. If it is far away, it suggests a low probability of the null-hypothesis-mean being true, leading us to reject the null hypothesis.

In engineering, we typically assume that the mean and standard deviation are given and true, and we compute the probability of observing the sample. However, in hypothesis testing with a small number of samples, we are testing whether the given mean is true or not.

To address this, we need a distribution that adjusts itself based on the number of observations, widening when there are fewer samples. The t-distribution serves this purpose, as it is dependent on the sample size.

There are different types of t-tests:

  • One-sample t-test: Compares the mean of a sample with a known population mean.
    • Discussion so far is for one sample test
  • Two-sample t-test: Compares the means of two independent groups.
    • To compare means of two independent groups
    • Scores of student who get 8 hour sleep vs four hour sleep
    • Question we want to answer is are there any significant difference in there scores?
    • In one sample test (In numerator of t-score) we are comparing sample mean with population mean
    • In two sample test it compares means of two independently drawn sample
    • And in denominator as well SEM formula is modified    
    • Example
      • A/B testing on e-commerce site where you compare CTR before and after
        • This is two sample because you don’t have standard value of CTR before the feature
        • Even you will see some difference in AA test
  • Paired t-test: Compares the means of two conditions using the same samples.
    • This is essentially a one-sample t-test on the differences between values at two conditions.
    • Same samples are used in two different conditions
    • 10 people before medication and same 10 people after medication
      • We want to check if medication has any effect
    • Different time points are used for market calculation
    • This essentially is a one sample T-test on the differences of value at two different conditions
    • Example
      • Interleaving test in e-commence search system
      • For each search page you will assign some score to control and variant
  • One-sided t-test: Tests a hypothesis in one direction (e.g., weight of dairy milk is less than 100g).
  • Two-sided t-test: Tests a hypothesis in both directions (e.g., weight of dairy milk is not equal to 100g).

P-values represent the probability of finding the observed or more extreme results when the null hypothesis is true. It is described in terms of rejecting the null hypothesis when it is actually true, but it is not a direct probability of this state.

For further examples and details, you can refer to the following link: Example Link