On Interleaving

This post summarises some concepts around interleaving. Take away for me in the topic is realisation that in some cases interleaving can be conclusive while a/b test is not. (examples of search improvement on small % of queries) and the intuition that why it converges faster. Also there is a broader topic – design of experiments. It is generally used in search systems and recommender systems. Both can come under umbrella of retrieval systems.

What is interleaving ?

  • It is a method to evaluate ranking system.
  • If we want to measure effectiveness of two ranker, we prepare a merged list and observe customer’s interaction on that.

Why do we need interleaving ?

  • It allows us to experiment faster.
    • It converges faster than a/b tests
  • For search systems many times a/b test might not converge at all
    • Say an improvement is affecting small % of search queries
  • Search system has low convergence say 4 %. How to measure algorithm that improves 0.5 % on top of that.
    • Lots of impressions are needed for convergence in a/b test.

Illustrative Examples

  • Shoe
    • Suppose for a given shoe design we have two compelling soles, we want to measure which one has better resistance to wear/tear
    • We have 100 people to test on
    • option 1 – classic a/b test
      • 50 people get each type of shoes
      • Two variability
        • How active the person under tests it
        • Quality of sole
    • option 2 – interleaved test
      • Different sole in left and right pair of shoe
      • Just one variability
        • Quality of sole
  • Soda [0]
    • We want to test what is more popular among population – Pepsi or Coke ?
    • option 1 – classic a/b test
      • Split population into two groups
        • One group receives Pepsi
        • Another one receives Coke
      • Measure soda consumption between two
      • Variability
        • Wide variation in soda consumption habit
        • Heavy soda consumer would be small % of population but large % of contributor
    • option 2 – interleaving
      • Allow people to choose either Pepsi or Coke.
      • Don’t mark a label, but can be visually distinguishable

For some tests interleaving is not possible.

  • In the shoe examples instead of sole, if we have two different design of shoe
  • In search systems, if you are changing some UI elements

Two tests Netflix did while introducing new interleaving system

  1. Compare sensitivity against traditional a/b tests [0]
    1. Before the test it was known that algorithm B is better than A
    2. They measure no of impressions it took to converge

2. Correlation of interleaving result with a/b tests [0]

On Metrics

  • Search systems have business metric like conversion & revenue
  • Netflix has business metrics like retention and streaming hours

Reference

[0] Netflix blog

Parametric and Nonparametric tests

We rarely heard of nonparametric tests while reading standard statistical books. However there are some scenarios where they should be used instead of parametric tests. [1] has beautiful blog about it, I am putting just a summary from that.

 

Different Tests

Table below displays various tests, I have verified that all of these tests are available in python stats package.

tests_1

When to Use Parametric Tests

  • Parametric tests can perform well with skewed and nonnormal distributions
    • It is important to follow guidance in the sample size of data as shown in table below
  • Parametric tests can perform well when the spread/variance of each group is different
  • It has Statistical power

tests_2

Reasons to Use Nonparametric Tests

  • Your area of study is better represented by the median
    • Income distribution is skewed and median is more useful than mean
    • Few billionaires can boost up the mean significantly
  • You have a very small sample size
    • Even less than what is mentioned in table above
  • You have ordinal data, ranked data, or outliers that you can’t remove

 

 

References:

[1] : http://blog.minitab.com/blog/adventures-in-statistics-2/choosing-between-a-nonparametric-test-and-a-parametric-test