Despite promising theory, both XML and English Language heads failed to deliver practical compression improvements
When used alone, specialized heads couldn't beat adaptive entropy coders
Benchmark: enwik9 (1 GB Wikipedia XML)
Adaptive Context Mixing: 5.2:1 ✓
Generic Static FSE: 4.8:1 ✓
────────────────────────────────────
XML Head (standalone): 4.1:1 ✗
English Head (standalone): 3.9:1 ✗
Problem: Specialized structure
exploitation ≪ adaptive entropy modeling
Why? Adaptive entropy coders (ICM, ISSE) naturally learn structural patterns without explicit parsing. Hand-coded structure detection missed subtle statistical dependencies that context mixing discovered automatically.
When used as a preprocessor stage, they increased entropy above natural language baseline
Entropy Analysis:
Raw English text: 4.5 bits/byte
After XML Head: 5.8 bits/byte ↑
After English Head: 5.6 bits/byte ↑
Raw XML document: 3.2 bits/byte
After XML Head: 4.9 bits/byte ↑
Problem: Preprocessing destroyed
natural redundancy patterns
Why? Dictionary encoding and space folding created uniform token streams that eliminated the low-entropy patterns (repetitive words, predictable spacing) that downstream compressors rely on.
The preprocessor + entropy coder pipeline suffered from information loss at the boundary
Intended Pipeline:
Raw XML
↓ [XML Head: structure parsing]
Token stream (lower entropy?)
↓ [Adaptive Head: entropy coding]
Compressed output
Expected: 6.5:1 compression
Actual Result:
Raw XML (H = 3.2 bits/byte)
↓ [XML Head]
Token stream (H = 4.9 bits/byte) 🔥
↓ [Adaptive Head struggles]
Compressed output
Achieved: 3.8:1 compression ✗
Root cause: The preprocessor removed human-readable redundancy (tag names, spacing) but created machine-readable randomness (dictionary indices, structure codes). Downstream compressors trained on natural text patterns couldn't adapt to the artificial token distribution.
Rice-Golomb Experiment:
Dictionary index distribution: Zipfian (common words frequent)
Rice-Golomb parameter k: tuned for top-100 frequency
Result: 4.3:1 compression (vs 5.2:1 adaptive baseline)
Conclusion: Even optimal variable-length coding of dictionary
indices couldn't recover the entropy increase from tokenization.
Key insight: End-to-end learning (adaptive context mixing) beats hand-crafted multi-stage pipelines because it optimizes the entire compression objective jointly, not stage-by-stage.
FA-CVM segmentation succeeded where these failed because it routes to existing optimized coders rather than creating artificial intermediate representations.