Beyond the core three-way routing, we experimented with domain-specific compression heads that exploit structural patterns for "space folding" effects
Concept: Exploit XML's highly repetitive structure (tags, attributes, whitespace) to achieve space folding similar to RLE
Input XML:
<article>
<title>Data Compression</title>
<author>Smith</author>
<section>
<paragraph>Text...</paragraph>
</section>
</article>
Space Folding Opportunities:
• Indentation: " " repeated
• Tag pairs: <tag>...</tag> patterns
• Attribute quotes: key="value"
• Common tags: <p>, <div>, <span>
<tag>, predict closing </tag> and encode only contentCompression Pipeline:
1. Parse XML structure
2. Dictionary-encode common tags
3. RLE-compress whitespace runs
4. Predict tag closures
5. FSE-encode remaining content
Result: ~15-25% better than
generic static head on XML data
Concept: Exploit English language patterns (word boundaries, common words, punctuation spacing) for space folding
Input English Text:
The quick brown fox jumps over
the lazy dog. The dog sleeps.
Space Folding Opportunities:
• Word spacing: " " after words
• Common words: "the", "and", "of"
• Punctuation patterns: ". ", ", "
• Case transitions: "The" vs "the"
Compression Pipeline:
1. Tokenize into words
2. Dictionary-encode common words
3. Fold spaces (implicit between words)
4. Predict punctuation+space patterns
5. Case-flag encoding for capitals
6. FSE-encode rare words
Result: ~10-20% better than
generic static head on English prose
Both heads exploit the same core principle: predictable patterns can be "folded away" — similar to how RLE compresses constant runs
Generalized Pattern:
if (pattern_is_predictable) {
encode(deviation_from_prediction); // Small value
} else {
encode(full_content); // Fallback to generic compression
}
Space Folding = eliminating bytes that can be predicted from context
(similar to how RLE eliminates redundant constant bytes)
Lesson learned: Domain-specific optimization is tempting, but robust general-purpose algorithms (FA-CVM + adaptive routing) often win on real-world mixed data. Still, these experiments informed the final architecture.