Honorable Mentions: Specialized Domain Heads

XML Head

Structure-Aware Compression

Concept: Exploit XML's highly repetitive structure (tags, attributes, whitespace) to achieve space folding similar to RLE

Input XML:
<article>
  <title>Data Compression</title>
  <author>Smith</author>
  <section>
    <paragraph>Text...</paragraph>
  </section>
</article>

Space Folding Opportunities:
• Indentation: "  " repeated
• Tag pairs: <tag>...</tag> patterns
• Attribute quotes: key="value"
• Common tags: <p>, <div>, <span>

Pattern Exploitation

3 Techniques

Whitespace Folding: Collapse indentation runs (spaces/tabs) using RLE-like encoding
Tag Dictionary: Build frequency table of common tags, encode as short indices
Structure Prediction: After <tag>, predict closing </tag> and encode only content

Compression Pipeline:
1. Parse XML structure
2. Dictionary-encode common tags
3. RLE-compress whitespace runs
4. Predict tag closures
5. FSE-encode remaining content

Result: ~15-25% better than
generic static head on XML data

English Language Head

Linguistic Pattern Compression

Concept: Exploit English language patterns (word boundaries, common words, punctuation spacing) for space folding

Input English Text:
The quick brown fox jumps over
the lazy dog. The dog sleeps.

Space Folding Opportunities:
• Word spacing: " " after words
• Common words: "the", "and", "of"
• Punctuation patterns: ". ", ", "
• Case transitions: "The" vs "the"

Pattern Exploitation

6 Techniques

Space Folding: Implicit word spacing (no need to encode space after every word)
Common Word Dictionary: Top 100 English words → 7-bit codes instead of full spelling
Common Nouns: Frequent nouns (person, place, thing) encoded with short indices
Prepositional Phrases: Patterns like "of the", "in the", "to the" as single tokens
Punctuation Context: After period, predict space + capital letter
Case Flags: Encode case changes as single bit instead of separate character

Compression Pipeline:
1. Tokenize into words
2. Dictionary-encode common words
3. Fold spaces (implicit between words)
4. Predict punctuation+space patterns
5. Case-flag encoding for capitals
6. FSE-encode rare words

Result: ~10-20% better than
generic static head on English prose

The Space Folding Concept

Both heads exploit the same core principle: predictable patterns can be "folded away" — similar to how RLE compresses constant runs

RLE (REPEAT Head)

0x20 × 1000

Pure constant run: encode value once + count

XML Space Folding

Predictable structure: encode opening, predict closing

English Space Folding

word1 word2

Predictable spacing: encode words, fold spaces implicitly

Generalized Pattern:

if (pattern_is_predictable) {
    encode(deviation_from_prediction);  // Small value
} else {
    encode(full_content);  // Fallback to generic compression
}

Space Folding = eliminating bytes that can be predicted from context
              (similar to how RLE eliminates redundant constant bytes)

Honorable Mentions: Specialized Domain Heads

Experimental Compression Heads

XML Head

English Language Head

The Space Folding Concept

Why "Honorable Mentions"?