On Interleaving

This post summarises some concepts around interleaving. Take away for me in the topic is realisation that in some cases interleaving can be conclusive while a/b test is not. (examples of search improvement on small % of queries) and the intuition that why it converges faster. Also there is a broader topic – design of experiments. It is generally used in search systems and recommender systems. Both can come under umbrella of retrieval systems.

What is interleaving ?

  • It is a method to evaluate ranking system.
  • If we want to measure effectiveness of two ranker, we prepare a merged list and observe customer’s interaction on that.

Why do we need interleaving ?

  • It allows us to experiment faster.
    • It converges faster than a/b tests
  • For search systems many times a/b test might not converge at all
    • Say an improvement is affecting small % of search queries
  • Search system has low convergence say 4 %. How to measure algorithm that improves 0.5 % on top of that.
    • Lots of impressions are needed for convergence in a/b test.

Illustrative Examples

  • Shoe
    • Suppose for a given shoe design we have two compelling soles, we want to measure which one has better resistance to wear/tear
    • We have 100 people to test on
    • option 1 – classic a/b test
      • 50 people get each type of shoes
      • Two variability
        • How active the person under tests it
        • Quality of sole
    • option 2 – interleaved test
      • Different sole in left and right pair of shoe
      • Just one variability
        • Quality of sole
  • Soda [0]
    • We want to test what is more popular among population – Pepsi or Coke ?
    • option 1 – classic a/b test
      • Split population into two groups
        • One group receives Pepsi
        • Another one receives Coke
      • Measure soda consumption between two
      • Variability
        • Wide variation in soda consumption habit
        • Heavy soda consumer would be small % of population but large % of contributor
    • option 2 – interleaving
      • Allow people to choose either Pepsi or Coke.
      • Don’t mark a label, but can be visually distinguishable

For some tests interleaving is not possible.

  • In the shoe examples instead of sole, if we have two different design of shoe
  • In search systems, if you are changing some UI elements

Two tests Netflix did while introducing new interleaving system

  1. Compare sensitivity against traditional a/b tests [0]
    1. Before the test it was known that algorithm B is better than A
    2. They measure no of impressions it took to converge

2. Correlation of interleaving result with a/b tests [0]

On Metrics

  • Search systems have business metric like conversion & revenue
  • Netflix has business metrics like retention and streaming hours

Reference

[0] Netflix blog