Counting a Billion Unique Items with Almost No Memory

The Problem: Why Exact Counting Fails at Scale

Counting unique elements sounds trivial, but exact counting can fail at scale due to high memory requirements. For instance, counting unique elements in a stream of 1 billion 64-bit integers requires roughly 8 GB of memory. The CVM algorithm offers a solution to this problem, providing an estimate of unique elements with high accuracy and minimal memory usage.

Why This Matters

The CVM algorithm addresses a significant technical challenge in data processing, where exact counting of unique elements can be prohibitively expensive in terms of memory. Traditional methods, such as using a hash table to deduplicate elements, can consume large amounts of memory, leading to out-of-memory errors or significant performance degradation. In contrast, the CVM algorithm provides a probabilistic estimate of unique elements, allowing for a trade-off between accuracy and memory usage.

Key Insights

The CVM algorithm achieves 98% accuracy with a few kilobytes of memory: This is a significant improvement over traditional exact counting methods, which require large amounts of memory.
The algorithm uses stochastic sampling with geometric probability: This approach allows the algorithm to estimate the number of unique elements without storing all elements.
The HyperLogLog algorithm, a widely used probabilistic counting algorithm, requires more memory and complexity than CVM: HyperLogLog requires multiple registers and hash functions, making it less suitable for applications with strict memory constraints.

Working Example

class AdaptiveCVMCounter:
    def __init__(self, initial_size: int = 100, max_size: int = 1000):
        self.memory_size = initial_size
        self.max_size = max_size
        self.memory: set = set()
        self.current_round = 0

    def process_element(self, element) -> None:
        if element in self.memory:
            for _ in range(self.current_round + 1):
                if random.random() >= 0.5:
                    self.memory.discard(element)
                    break
        else:
            self.memory.add(element)
            if len(self.memory) >= self.memory_size:
                self._start_new_round()

    def _start_new_round(self) -> None:
        self.memory = set(random.sample(list(self.memory), len(self.memory) // 2))
        self.current_round += 1

    def estimate_unique_count(self) -> int:
        return int(len(self.memory) * (2 ** self.current_round))

Practical Applications

Use Case: Counting unique users in a web analytics application without storing user identifiers.
Pitfall: Using exact counting methods for large datasets, leading to out-of-memory errors or performance degradation.

References:

https://github.com/RMANOV/Number-of-Unique-Elements-Prediction
Chakraborti, Vinodchandran, Meel (2024) - Original CVM paper
Flajolet, Martin — “Probabilistic Counting Algorithms for Data Base Applications” (1985)
Flajolet, Fusy, Gandouet, Meunier — “HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm” (2007)

On This Page

The Problem: Why Exact Counting Fails at Scale

Why This Matters

Key Insights

Working Example

Practical Applications

Continue reading

Related Content

AI News Weekly Summary: Jan 25 - Feb 01, 2026

Quantum Algorithm Breakthrough: Potential Speedup in Counting Symmetric Group Coefficients

How Can We Build Scalable and Reproducible Machine Learning Experiment Pipelines Using Meta Research Hydra?