Honorable Mentions: Specialized Domain Heads

Experimental Compression Heads

Beyond the core three-way routing, we experimented with domain-specific compression heads that exploit structural patterns for "space folding" effects

XML Head

Structure-Aware Compression

Concept: Exploit XML's highly repetitive structure (tags, attributes, whitespace) to achieve space folding similar to RLE

Input XML:
<article>
  <title>Data Compression</title>
  <author>Smith</author>
  <section>
    <paragraph>Text...</paragraph>
  </section>
</article>

Space Folding Opportunities:
• Indentation: "  " repeated
• Tag pairs: <tag>...</tag> patterns
• Attribute quotes: key="value"
• Common tags: <p>, <div>, <span>
Pattern Exploitation
3 Techniques
  1. Whitespace Folding: Collapse indentation runs (spaces/tabs) using RLE-like encoding
  2. Tag Dictionary: Build frequency table of common tags, encode as short indices
  3. Structure Prediction: After <tag>, predict closing </tag> and encode only content
Compression Pipeline:
1. Parse XML structure
2. Dictionary-encode common tags
3. RLE-compress whitespace runs
4. Predict tag closures
5. FSE-encode remaining content

Result: ~15-25% better than
generic static head on XML data

English Language Head

Linguistic Pattern Compression

Concept: Exploit English language patterns (word boundaries, common words, punctuation spacing) for space folding

Input English Text:
The quick brown fox jumps over
the lazy dog. The dog sleeps.

Space Folding Opportunities:
• Word spacing: " " after words
• Common words: "the", "and", "of"
• Punctuation patterns: ". ", ", "
• Case transitions: "The" vs "the"
Pattern Exploitation
6 Techniques
  1. Space Folding: Implicit word spacing (no need to encode space after every word)
  2. Common Word Dictionary: Top 100 English words → 7-bit codes instead of full spelling
  3. Common Nouns: Frequent nouns (person, place, thing) encoded with short indices
  4. Prepositional Phrases: Patterns like "of the", "in the", "to the" as single tokens
  5. Punctuation Context: After period, predict space + capital letter
  6. Case Flags: Encode case changes as single bit instead of separate character
Compression Pipeline:
1. Tokenize into words
2. Dictionary-encode common words
3. Fold spaces (implicit between words)
4. Predict punctuation+space patterns
5. Case-flag encoding for capitals
6. FSE-encode rare words

Result: ~10-20% better than
generic static head on English prose

The Space Folding Concept

Both heads exploit the same core principle: predictable patterns can be "folded away" — similar to how RLE compresses constant runs

RLE (REPEAT Head)
0x20 × 1000
Pure constant run: encode value once + count
XML Space Folding
<tag>...</tag>
Predictable structure: encode opening, predict closing
English Space Folding
word1 word2
Predictable spacing: encode words, fold spaces implicitly
Generalized Pattern:

if (pattern_is_predictable) {
    encode(deviation_from_prediction);  // Small value
} else {
    encode(full_content);  // Fallback to generic compression
}

Space Folding = eliminating bytes that can be predicted from context
              (similar to how RLE eliminates redundant constant bytes)

Why "Honorable Mentions"?

✓ Strengths
Effective on target data (XML, English), meaningful compression gains over generic methods
⚠ Limitations
Narrow applicability, complexity overhead for specialized parsers
📊 Reality Check
Generic adaptive head often matched or beat them on mixed enwik9 data

Lesson learned: Domain-specific optimization is tempting, but robust general-purpose algorithms (FA-CVM + adaptive routing) often win on real-world mixed data. Still, these experiments informed the final architecture.