FA-CVM: Fast Adaptive Context Variance Model

The Problem with Standard CVM

  • CVM only counts distinct elements — it estimates cardinality (how many unique symbols)
  • CVM doesn't track frequencies — it can't tell you how often each symbol appears
  • Can't compute entropy directly — entropy requires P(x) = frequency/total, not just "is x present?"
  • No histogram replacement — standard CVM gives you {a, b, c} but not {a:50, b:30, c:20}

David's Solution: FA-CVM Extension

Modified the Chakraborty-Variyam-Meel algorithm to add sparse frequency tracking while maintaining O(1/ε) memory complexity

// Standard CVM: only tracks existence
sampleSet.Add(hash);  // Just a HashSet<uint>

// FA-CVM: tracks existence + frequency
sampleSet.Add(hash);
sparseFrequencies[element]++;  // Dictionary<byte, int>

Algorithm Overview

  1. Sample Selection: Use CVM hash-based sampling (keep element if hash < ε × uint.MaxValue)
  2. Frequency Tracking: For sampled elements, maintain counter in Dictionary<byte, int>
  3. Incremental Updates: If element already sampled, increment its frequency; if new and passes threshold, add it
  4. Downsampling: When sample set exceeds threshold, remove elements with hash ≥ ε and their frequencies

Entropy Estimation

  1. Sampled Entropy: Calculate H = -Σ p(x) log₂ p(x) for tracked frequencies
  2. Distinct Extrapolation: Estimate total distinct symbols using CVM formula (samples/ε)
  3. Unseen Contribution: Estimate entropy from unsampled symbols assuming uniform distribution
  4. Combined Estimate: Add sampled + extrapolated entropy components

Implementation Highlights

Memory Complexity
O(1/ε)
Default ε = 0.1 → 256 sample threshold vs 256 histogram bins
Hash Function
FNV-1a
Fast non-cryptographic hash for probabilistic sampling
Mode Flag
trackFrequencies
Constructor parameter enables FA-CVM sparse frequency tracking

Key Differences: CVM vs FA-CVM

Standard CVM Output:
Estimated Distinct: 127
Sample Set Size: 243
(No frequency information)
FA-CVM Output:
Estimated Distinct: 127
Estimated Entropy: 6.82 bits
Tracked Frequencies: {a:1523, e:981, ...}
(Full frequency map available)