Segmentation Detector: Three-Way Routing

The Problem: Which Compression Head?

Different data patterns need different algorithms. HoloCodec analyzes each segment and routes it to the best compression head:

REPEAT
Constant runs: 0x20 0x20 0x20... (spaces, nulls, padding)
STATIC
Stationary distribution: English text, structured data
ADAPTIVE
Shifting distribution: mixed content, transitions

Step 1: Boundary Detection (FA-CVM Sliding Window)

// Slide 2048-byte window across data, stride 256 bytes
for (pos = 0; pos < chunk.Length; pos += 256) {
    window = chunk[pos..pos+2048];
    entropy = FACVM.EstimateEntropy(window);
    distinct = FACVM.EstimateDistinct(window);
    
    // Detect regime change: entropy or distinct jump
    if (|entropy - lastEntropy| > 0.15 || 
        |distinct - lastDistinct| > 8) {
        cuts.Add(pos);  // Mark segment boundary
    }
}

Output: Segment boundaries at positions where statistical properties change (e.g., text→binary, code→whitespace)

Detecting REPEAT Segments

purity = maxFreq / totalBytes;

if (purity >= 0.985) {
    // Near-constant run
    if (purity >= 0.9999)
        return REPEAT; // Pure run
    
    // Check if matches prior table
    kl = KL_divergence(current, priorTable);
    if (kl <= 0.03 bits)
        return REPEAT; // Reuse table
}

Criteria:

  • Purity ≥ 98.5% (one byte dominates)
  • Or KL-divergence ≤ 0.03 bits vs prior constant table
  • Examples: spaces (0x20), nulls (0x00), newlines (0x0A)

Detecting STATIC Segments

// Sample center 25% of segment
centerSample = segment[len/4 .. 3*len/4];
histogram = BuildHistogram(centerSample);

// Check stationarity: split into thirds
variance = Var(H_third1, H_third2, H_third3);

if (variance < 0.01 && 
    (entropy < 5.2 || purity >= 0.25)) {
    return STATIC; // Structured & stationary
}

Criteria:

  • Low entropy variance (< 0.01) across segment thirds
  • Entropy < 5.2 bits (structured) OR purity ≥ 25%
  • Examples: English text, XML, source code

Detecting ADAPTIVE Segments

Non-Stationary (high variance):
variance = Var(H₁, H₂, H₃);
if (variance >= 0.01)
    return ADAPTIVE;
High Entropy + Diffuse:
if (entropy > 5.2 && 
    purity < 0.25 && variance < 0.01)
    return ADAPTIVE; // Near-random

Criteria (fallback mode):

  • Non-stationary: Entropy variance ≥ 0.01 (distribution shifts across segment)
  • High-entropy stationary: H > 5.2 bits, low purity (< 25%), diffuse distribution
  • Rejection path: Fails REPEAT purity test, fails STATIC stationarity/structure test
  • Examples: Compressed data, encrypted data, format transitions, mixed content

Complete Decision Tree

for each segment:
    1. Sample center 25% (min 256 bytes, max 4-16KB)
    2. Build histogram, compute: purity, entropy (H), distinct (F₀)
    
    3. IF purity >= 98.5%:
       → Check KL vs prior constant tables
       → IF KL <= 0.03 bits → REPEAT (reuse table)
       → IF purity >= 99.99% → REPEAT (new pure run)
       → ELSE → STATIC (dominant-byte distribution)
    
    4. ELSE:
       → Split segment into thirds, compute H₁, H₂, H₃
       → variance = Var(H₁, H₂, H₃)
       
       → IF variance < 0.01 AND (H < 5.2 OR purity >= 0.25):
          → Check KL vs prior static tables
          → IF KL <= 0.08 bits → STATIC (reuse table)
          → ELSE → STATIC (new table)
       
       → ELSE IF variance < 0.01 AND H > 5.2:
          → ADAPTIVE (stationary but high-entropy)
       
       → ELSE:
          → ADAPTIVE (non-stationary)