Where it is used
- Suppose you have a streaming data and you want to randomly sample from it. You don’t know how many items will be coming in.
- You have large set of items to sample from and you want to do it in single pass
How it is done
- Suppose you want to sample 1000 items.[1]
- You take first 1000 items and put it into reservoir
- Next you will take 1001th item with probability 1000/1001
- You take a random number and if it is less than 1000/1001, you add this item to reservoir
- Remember the CDF trick
- When you add this item, you randomly remove any other item from reservoir
Alternative
- Pick items from stream, generate a random no and put it in priority queue
- This is how order by rand() in sql works [1]
References
[1] https://gregable.com/2007/10/reservoir-sampling.html
[2] https://www.youtube.com/watch?v=Ybra0uGEkpM (proof)

