Reservoir Sampling Algorithm In C++

Sampling plays a role in data science and statistics, allowing us to extract a subset from a larger population. One efficient method is reservoir sampling , which involves selecting a fixed number of items (k) from a dataset or stream of size (n).

This article aims to present an overview of the reservoir sampling algorithm and demonstrate its implementation in C++. Reservoir sampling proves handy when drawing a sample from a massive dataset that cannot be stored entirely in memory. It maintains a fixed-size reservoir while processing each item sequentially from the stream.

We will delve into the rationale. The insight behind reservoir sampling before showcasing a code snippet that illustrates its implementation in C++. This code will cover aspects such as initializing the reservoir, probabilistically replacing elements, and displaying the output. This demonstration shows how the reservoir consistently upholds a sample as more data points are processed from the stream.

Comprehending reservoir sampling holds value in domains like analysis, A/B testing, recommendation systems and handling large real-world datasets effectively.

The C++ implementation showcases how the algorithm can be applied in a general programming language. Let's kick things off by diving into the idea of reservoir sampling.

What is Reservoir Sampling Algorithm?

Reservoir sampling aims to select k items from a pool of n items, where n could be an unknown quantity. This technique enables you to acquire a distributed sample of k items from a potentially vast dataset in just one traversal through the data.

The key ideas behind reservoir sampling are:

✅ Maintain a reservoir of size k to store the random sample
✅ Fill the reservoir with the first k items from the data stream
✅ For subsequent items, randomly replace an item in the reservoir with the new item with probability k/i, where i is the index of the new item.

This gives each item an equal probability of k/n ending up in the reservoir. Thus, we obtain a uniform random sample.

The algorithm works as follows:

✅ Initialize a reservoir array of size k to store the random sample
✅ Insert the first k items from the stream into the reservoir array
✅ For item i, where i > k: Generate a random number between 0 and 1 If random number < k/i, replace a random element in the reservoir with item i Otherwise, ignore item i
✅ After processing all n items, the reservoir array contains the k randomly chosen items.
✅ Generate a random number between 0 and 1
✅ If random number < k/i, replace a random element in the reservoir with item i
✅ Otherwise, ignore item i

The probability k/i gradually decreases as i increases, so each item is less likely to replace an existing item in the reservoir. This gives each item an equal probability of selection. The reservoir maintains a uniform sample as more items are processed.

The reservoir sampling algorithm is efficient and requires only O(n) time complexity to process n items with a fixed-size reservoir of k items. It only makes a single pass over the data. It makes it useful for large data streams that cannot be stored entirely in memory.

Applications and where is it used

✅ Dealing with large data streams: Reservoir sampling is useful when you must sample from a massive data stream that cannot fit entirely in memory. It only requires a fixed-size reservoir and processes items sequentially.
✅ Unknown data size: The algorithm works even if the number of items n is unknown beforehand. It maintains the uniform random sample in the reservoir regardless of n.
✅ Obtaining representative samples: The reservoir contains a random subset statistically representative of the larger population. It is useful for downstream analysis.
✅ A/B testing: Reservoir sampling allows users to be divided into uniform random samples for controlled experiments. For example, testing new features on a sample of users.
✅ Recommendation systems: By uniformly sampling users, items, or product ratings, the recommendation model can be trained on an unbiased sample to recommend content.
✅ Machine learning: Training ML models on a random subset of data points reduces bias and leads to better model generalization.
✅ Data analytics: Analyzing a random sample facilitates computing statistics or visualizations that correctly reflect the complete data distribution.
✅ Time series sampling: The reservoir algorithm can sample fixed intervals from streams like server logs, stock ticker data, or sensor readings over time.

Overall, reservoir sampling provides a way to obtain a small uniform sample from a continuously generated big data stream. It is useful for statistical sampling and analysis of massive real-world data in many domains.

Example:

C++ code for the Reservoir Sampling algorithm to randomly select k items from a stream of unknown size n.

Example


#include <iostream>
#include <cstdlib>
#include <ctime>
using namespace std;

const int k = 5; // reservoir size
int reservoir[k]; 

void reservoirSampling(int stream[], int n) {
  srand(time(NULL));
  
  // insert first k items into reservoir[]
  for(int i = 0; i < k; i++) {
    reservoir[i] = stream[i];  
  }

  //Replace items with gradually decreasing probability
  for(int i = k; i < n; i++) {
    int j = rand() % (i+1); 
    if(j < k) {
      reservoir[j] = stream[i];
    }
  }
}

int main() {
  int stream[] = {10, 2, 43, 55, 9, 21, 61, 8, 12};
  int n = sizeof(stream)/sizeof(stream[0]);
  
  reservoirSampling(stream, n);
  
  cout << "Random k = " << k << " items out of stream are: ";
  for(int i = 0; i < k; i++) {
    cout << reservoir[i] << " "; 
  }
  
  return 0;
}

Output:

Output


Random k = 5 items out of stream are: 55 9 61 8 12

The main concept here is that we randomly swap out elements in the reservoir array with a decreasing likelihood as more items are scanned from the stream. It ensures that items towards the end have a chance of being selected as those at the beginning.

Explanation:

Import necessary. Set up constants;

Include cstdlib for rand and srand

Use ctime to seed srand

Define k as the reservoir size

Create an array of size k to hold items
Develop the reservoirSampling function that takes the stream and size n as parameters
Initialize the number generator using the time
Directly add the k items to the reservoir array

It sets up the reservoir with k items

Begin a loop from k to n 1 to handle the remaining stream items
Generate a number j between 0. I, where i ranges from k to n 1
If j falls between 0 and k 1, replace the item at index j in the reservoir with the item

This process randomly swaps an existing item with a decreasing likelihood

. Likelihood = k/(i+1)

Output the k items from the reservoir array.

We start by setting up the reservoir array with the k items. Then, as we go through the stream from k to n 1, we randomly swap out elements in the array with items. This process guarantees that we achieve a random selection during replacement.

Conclusion:

The Reservoir Sampling algorithm enables us to select k items randomly from a data stream without knowing the number of items, solving the problem efficiently with a time complexity of O(n) and space complexity of O(k) .

Key points in the C++ implementation;

✅ It starts by filling a fixed-size reservoir array with the k stream items, introducing random data into the reservoir.
✅ The main strategy involves using a replacement approach with decreasing probabilities from k/n to k/(n+1) as more stream items are scanned. It ensures that all items are uniformly and randomly selected for inclusion in the reservoir.
✅ The functions rand and srand from cstdlib are utilized to generate numbers for the selection strategy. Srand is initialized with the time to produce outputs in each run.
✅ The code is straightforward, well-crafted, and accurately captures the essence of Reservoir Sampling. It produces samples even without knowledge of the total population count.
✅ In essence, this C++ implementation uses built-in functions, linear time complexity, constant space complexity, probabilistic selection techniques, and simple logic to effectively create a subset of samples from a sized data stream. It serves as a reference for implementing Reservoir Sampling in applications.
✅ The method of reservoir sampling finds uses in scenarios involving data streaming and randomization. For instance, it can be applied to analyze sensor data, monitor search trends on a search engine, simulate financial data flows, randomly pick packets for network monitoring and more.

What is Reservoir Sampling Algorithm?

The algorithm works as follows:

Applications and where is it used

Example:

Conclusion:

Input Required