This post summarises some concepts around interleaving. Take away for me in the topic is realisation that in some cases interleaving can be conclusive while a/b test is not. (examples of search improvement on small % of queries) and the intuition that why it converges faster. Also there is a broader topic – design of experiments. It is generally used in search systems and recommender systems. Both can come under umbrella of retrieval systems.
What is interleaving ?
- It is a method to evaluate ranking system.
- If we want to measure effectiveness of two ranker, we prepare a merged list and observe customer’s interaction on that.
Why do we need interleaving ?
- It allows us to experiment faster.
- It converges faster than a/b tests
- For search systems many times a/b test might not converge at all
- Say an improvement is affecting small % of search queries
- Search system has low convergence say 4 %. How to measure algorithm that improves 0.5 % on top of that.
- Lots of impressions are needed for convergence in a/b test.
Illustrative Examples
- Shoe
- Suppose for a given shoe design we have two compelling soles, we want to measure which one has better resistance to wear/tear
- We have 100 people to test on
- option 1 – classic a/b test
- 50 people get each type of shoes
- Two variability
- How active the person under tests it
- Quality of sole
- option 2 – interleaved test
- Different sole in left and right pair of shoe
- Just one variability
- Quality of sole
- Soda [0]
- We want to test what is more popular among population – Pepsi or Coke ?
- option 1 – classic a/b test
- Split population into two groups
- One group receives Pepsi
- Another one receives Coke
- Measure soda consumption between two
- Variability
- Wide variation in soda consumption habit
- Heavy soda consumer would be small % of population but large % of contributor
- Split population into two groups
- option 2 – interleaving
- Allow people to choose either Pepsi or Coke.
- Don’t mark a label, but can be visually distinguishable
For some tests interleaving is not possible.
- In the shoe examples instead of sole, if we have two different design of shoe
- In search systems, if you are changing some UI elements
Two tests Netflix did while introducing new interleaving system
- Compare sensitivity against traditional a/b tests [0]
- Before the test it was known that algorithm B is better than A
- They measure no of impressions it took to converge

2. Correlation of interleaving result with a/b tests [0]

On Metrics
- Search systems have business metric like conversion & revenue
- Netflix has business metrics like retention and streaming hours
Reference
[0] Netflix blog


