Reservoir sampling

Where it is used

  • Suppose you have a streaming data and you want to randomly sample from it. You don’t know how many items will be coming in.
  • You have large set of items to sample from and you want to do it in single pass

How it is done

  • Suppose you want to sample 1000 items.[1]
  • You take first 1000 items and put it into reservoir
  • Next you will take 1001th item with probability 1000/1001
    • You take a random number and if it is less than 1000/1001, you add this item to reservoir
    • Remember the CDF trick
  • When you add this item, you randomly remove any other item from reservoir

Alternative

  • Pick items from stream, generate a random no and put it in priority queue
  • This is how order by rand() in sql works [1]

References

[1] https://gregable.com/2007/10/reservoir-sampling.html

[2] https://www.youtube.com/watch?v=Ybra0uGEkpM (proof)

Leave a comment