Power Analysis

When we run A/B test, we need to know how long we should keep it running. After 14 days we observe that test is not conclusive ? Should we still keep it running in the hope that we will get evidence in one more week or conclude that treatment is not significant. Power Analysis helps answer this question.

We can build our own guidelines – 5 M visitors or 14 days

Relation with Type 1 and Type 2 Error

  • Type 1 error is probability to wrongly reject null hypothesis
    • We can fix this by reducing p-value from 5 % to say 1 %
  • Type 2 error is probability of failing to accept alternate hypothesis
    • Alternate hypothesis is correct but we could not accept it
    • We failed to reject null hypothesis
  • Power = 1 – (Type 2 error)
    • Probability that we will reject null hypothesis when we should

4 Parameters Involved in Power Analysis

  • Alpha (p-val)
    • When we increase p-val from 5 % to 10 % power increases
    • We reduce probability of type 2 error
    • But type 1 error increases
  • Power
    • We defined it above
    • Typically we set it to 0.8
  • Number of samples, N
    • We want to find out number sample for which we should keep running tests
    • After which we can conclude it.
  • Effect Size
    • One feature brings 1 % lift in conversion
    • Other feature is brining 5 % lift in conversion
    • 2nd test will converge faster and hence has high power

There is formula connecting above 4 parameters. Once we fix 3 of them, we can find the 4th one.

How can I know effect size before the test ?

  • In most cases it is expected to have pilot data.
  • If not other source would be similar experiments in literature.
  • If none of above is available it is wise to do pilot study.

Types of Power Analysis

There are four types based on which parameter we are trying to find out.

  • A priori analysis
    • compute N given other three
    • Most common analysis
  • Post-hoc power analysis
    • Find power given other three
    • It is wise to do once test is completed
    • While calculating N before test you don’t have exact estimation of effect size, post test you have it
    • This helps you verify if you had sufficient samples to detect statistics you found
  • Criterion analysis
    • Compute alpha given other three
    • Rarely used
  • Sensitivity Analysis
    • Find effect size given other three
    • It is important when sample size is bounded
      • In online a/b test thing generally is not the case
      • For clinical trial it would be
    • For fixed samples you can estimate what effect you would find at max
      • If it is too less, don’t conduct an experiment
    • This is known as minimum detectable effect (MDE).

Calculation of effect size and power analysis formula

  • It is beyond scope of this post right now. Will add it in future.

On Interleaving

This post summarises some concepts around interleaving. Take away for me in the topic is realisation that in some cases interleaving can be conclusive while a/b test is not. (examples of search improvement on small % of queries) and the intuition that why it converges faster. Also there is a broader topic – design of experiments. It is generally used in search systems and recommender systems. Both can come under umbrella of retrieval systems.

What is interleaving ?

  • It is a method to evaluate ranking system.
  • If we want to measure effectiveness of two ranker, we prepare a merged list and observe customer’s interaction on that.

Why do we need interleaving ?

  • It allows us to experiment faster.
    • It converges faster than a/b tests
  • For search systems many times a/b test might not converge at all
    • Say an improvement is affecting small % of search queries
  • Search system has low convergence say 4 %. How to measure algorithm that improves 0.5 % on top of that.
    • Lots of impressions are needed for convergence in a/b test.

Illustrative Examples

  • Shoe
    • Suppose for a given shoe design we have two compelling soles, we want to measure which one has better resistance to wear/tear
    • We have 100 people to test on
    • option 1 – classic a/b test
      • 50 people get each type of shoes
      • Two variability
        • How active the person under tests it
        • Quality of sole
    • option 2 – interleaved test
      • Different sole in left and right pair of shoe
      • Just one variability
        • Quality of sole
  • Soda [0]
    • We want to test what is more popular among population – Pepsi or Coke ?
    • option 1 – classic a/b test
      • Split population into two groups
        • One group receives Pepsi
        • Another one receives Coke
      • Measure soda consumption between two
      • Variability
        • Wide variation in soda consumption habit
        • Heavy soda consumer would be small % of population but large % of contributor
    • option 2 – interleaving
      • Allow people to choose either Pepsi or Coke.
      • Don’t mark a label, but can be visually distinguishable

For some tests interleaving is not possible.

  • In the shoe examples instead of sole, if we have two different design of shoe
  • In search systems, if you are changing some UI elements

Two tests Netflix did while introducing new interleaving system

  1. Compare sensitivity against traditional a/b tests [0]
    1. Before the test it was known that algorithm B is better than A
    2. They measure no of impressions it took to converge

2. Correlation of interleaving result with a/b tests [0]

On Metrics

  • Search systems have business metric like conversion & revenue
  • Netflix has business metrics like retention and streaming hours

Reference

[0] Netflix blog