Why They Didn't Work

The Harsh Reality of Specialized Heads

Despite promising theory, both XML and English Language heads failed to deliver practical compression improvements

Problem #1: Weak Standalone Performance

When used alone, specialized heads couldn't beat adaptive entropy coders

Benchmark: enwik9 (1 GB Wikipedia XML)

Adaptive Context Mixing:    5.2:1  ✓
Generic Static FSE:          4.8:1  ✓
────────────────────────────────────
XML Head (standalone):       4.1:1  ✗
English Head (standalone):   3.9:1  ✗

Problem: Specialized structure
exploitation ≪ adaptive entropy modeling
Compression Gap
-21%
XML Head underperformed generic adaptive head by over 1 ratio point

Why? Adaptive entropy coders (ICM, ISSE) naturally learn structural patterns without explicit parsing. Hand-coded structure detection missed subtle statistical dependencies that context mixing discovered automatically.

Problem #2: Entropy Increase as Preprocessor

When used as a preprocessor stage, they increased entropy above natural language baseline

Entropy Analysis:

Raw English text:        4.5 bits/byte
After XML Head:          5.8 bits/byte  ↑
After English Head:      5.6 bits/byte  ↑

Raw XML document:        3.2 bits/byte
After XML Head:          4.9 bits/byte  ↑

Problem: Preprocessing destroyed
natural redundancy patterns
Entropy Inflation
+29%
XML preprocessing increased entropy from 3.2 to 4.9 bits/byte

Why? Dictionary encoding and space folding created uniform token streams that eliminated the low-entropy patterns (repetitive words, predictable spacing) that downstream compressors rely on.

Problem #3: Two-Stage Pipeline Inefficiency

The preprocessor + entropy coder pipeline suffered from information loss at the boundary

Intended Pipeline:

Raw XML
  ↓ [XML Head: structure parsing]
Token stream (lower entropy?)
  ↓ [Adaptive Head: entropy coding]
Compressed output

Expected: 6.5:1 compression
Actual Result:

Raw XML (H = 3.2 bits/byte)
  ↓ [XML Head]
Token stream (H = 4.9 bits/byte) 🔥
  ↓ [Adaptive Head struggles]
Compressed output

Achieved: 3.8:1 compression ✗

Root cause: The preprocessor removed human-readable redundancy (tag names, spacing) but created machine-readable randomness (dictionary indices, structure codes). Downstream compressors trained on natural text patterns couldn't adapt to the artificial token distribution.

Pipeline Loss
-41%
Two-stage approach: 3.8:1 vs direct adaptive: 5.2:1

Attempted Fixes (All Failed)

Fix Attempt #1
Rice-Golomb Coding for dictionary entries — still worse than adaptive baseline
Fix Attempt #2
Multiple dictionary sizes (100, 500, 1000 words) — no configuration beat adaptive
Fix Attempt #3
Hybrid approaches (partial tokenization) — added complexity without gains
Rice-Golomb Experiment:

Dictionary index distribution: Zipfian (common words frequent)
Rice-Golomb parameter k: tuned for top-100 frequency
Result: 4.3:1 compression (vs 5.2:1 adaptive baseline)

Conclusion: Even optimal variable-length coding of dictionary
indices couldn't recover the entropy increase from tokenization.

The Fundamental Lesson

Why Domain-Specific Heads Failed:

  1. Adaptive entropy coders already capture structure: ICM/ISSE chains learn XML patterns and English grammar without explicit parsing
  2. Preprocessing destroys context: Converting "the dog" → [dict:42, dict:89] loses bigram frequency information
  3. Entropy doesn't lie: If preprocessing increases entropy, downstream compression can't magically recover lost information
  4. Pipeline boundaries are costly: Each stage boundary introduces irreversible information loss

Key insight: End-to-end learning (adaptive context mixing) beats hand-crafted multi-stage pipelines because it optimizes the entire compression objective jointly, not stage-by-stage.

What Actually Works

FA-CVM segmentation succeeded where these failed because it routes to existing optimized coders rather than creating artificial intermediate representations.