a/b tests | Data Stories

This post summarises some concepts around interleaving. Take away for me in the topic is realisation that in some cases interleaving can be conclusive while a/b test is not. (examples of search improvement on small % of queries) and the intuition that why it converges faster. Also there is a broader topic – design of experiments. It is generally used in search systems and recommender systems. Both can come under umbrella of retrieval systems.

What is interleaving ?

It is a method to evaluate ranking system.
If we want to measure effectiveness of two ranker, we prepare a merged list and observe customer’s interaction on that.

Why do we need interleaving ?

It allows us to experiment faster.
- It converges faster than a/b tests
For search systems many times a/b test might not converge at all
- Say an improvement is affecting small % of search queries
Search system has low convergence say 4 %. How to measure algorithm that improves 0.5 % on top of that.
- Lots of impressions are needed for convergence in a/b test.

Illustrative Examples

Shoe
- Suppose for a given shoe design we have two compelling soles, we want to measure which one has better resistance to wear/tear
- We have 100 people to test on
- option 1 – classic a/b test
  - 50 people get each type of shoes
  - Two variability
    - How active the person under tests it
    - Quality of sole
- option 2 – interleaved test
  - Different sole in left and right pair of shoe
  - Just one variability
    - Quality of sole
Soda [0]
- We want to test what is more popular among population – Pepsi or Coke ?
- option 1 – classic a/b test
  - Split population into two groups
    - One group receives Pepsi
    - Another one receives Coke
  - Measure soda consumption between two
  - Variability
    - Wide variation in soda consumption habit
    - Heavy soda consumer would be small % of population but large % of contributor
- option 2 – interleaving
  - Allow people to choose either Pepsi or Coke.
  - Don’t mark a label, but can be visually distinguishable

For some tests interleaving is not possible.

In the shoe examples instead of sole, if we have two different design of shoe
In search systems, if you are changing some UI elements

Two tests Netflix did while introducing new interleaving system

Compare sensitivity against traditional a/b tests [0]
1. Before the test it was known that algorithm B is better than A
2. They measure no of impressions it took to converge

2. Correlation of interleaving result with a/b tests [0]

On Metrics

Search systems have business metric like conversion & revenue
Netflix has business metrics like retention and streaming hours

Reference

[0] Netflix blog

Data Stories

Tag a/b tests

On Interleaving