FA-CVM: Fast Adaptive Context Variance Model
The Problem with Standard CVM
- CVM only counts distinct elements — it estimates cardinality (how many unique symbols)
- CVM doesn't track frequencies — it can't tell you how often each symbol appears
- Can't compute entropy directly — entropy requires P(x) = frequency/total, not just "is x present?"
- No histogram replacement — standard CVM gives you {a, b, c} but not {a:50, b:30, c:20}
David's Solution: FA-CVM Extension
Modified the Chakraborty-Variyam-Meel algorithm to add sparse frequency tracking while maintaining O(1/ε) memory complexity
// Standard CVM: only tracks existence
sampleSet.Add(hash); // Just a HashSet<uint>
// FA-CVM: tracks existence + frequency
sampleSet.Add(hash);
sparseFrequencies[element]++; // Dictionary<byte, int>
Algorithm Overview
- Sample Selection: Use CVM hash-based sampling (keep element if hash < ε × uint.MaxValue)
- Frequency Tracking: For sampled elements, maintain counter in Dictionary<byte, int>
- Incremental Updates: If element already sampled, increment its frequency; if new and passes threshold, add it
- Downsampling: When sample set exceeds threshold, remove elements with hash ≥ ε and their frequencies
Entropy Estimation
- Sampled Entropy: Calculate H = -Σ p(x) log₂ p(x) for tracked frequencies
- Distinct Extrapolation: Estimate total distinct symbols using CVM formula (samples/ε)
- Unseen Contribution: Estimate entropy from unsampled symbols assuming uniform distribution
- Combined Estimate: Add sampled + extrapolated entropy components
Implementation Highlights
Memory Complexity
O(1/ε)
Default ε = 0.1 → 256 sample threshold vs 256 histogram bins
Hash Function
FNV-1a
Fast non-cryptographic hash for probabilistic sampling
Mode Flag
trackFrequencies
Constructor parameter enables FA-CVM sparse frequency tracking
Key Differences: CVM vs FA-CVM
Standard CVM Output:
Estimated Distinct: 127
Sample Set Size: 243
(No frequency information)
FA-CVM Output:
Estimated Distinct: 127
Estimated Entropy: 6.82 bits
Tracked Frequencies: {a:1523, e:981, ...}
(Full frequency map available)