enwik9

The Benchmark Dataset

What Is It?

enwik9 is the first 1 gigabyte (109 bytes) of the English Wikipedia XML dump. It serves as the standard benchmark dataset for the Hutter Prize.

What Makes It Challenging?

The dataset is a complex mixture of:

  • Natural Language: Human-readable article text in English
  • XML Structure: Opening/closing tags, attributes, nested hierarchies
  • Wikipedia Markup: MediaWiki-specific formatting syntax
  • Metadata: Timestamps, revision IDs, contributor information
  • Special Characters: Unicode, entities, escaped sequences

Example Structure

<page>
  <title>Article Name</title>
  <revision>
    <timestamp>2024-01-12</timestamp>
    <text>Article text with 
    '''bold''' and [[links]]...
    </text>
  </revision>
</page>
1 GB Exact Size
Mixed Content Type

Why This Dataset?

The heterogeneous nature of enwik9 makes it an excellent test for general-purpose compression algorithms. It requires handling multiple data patterns simultaneously: structured XML, natural language statistics, and repetitive markup—mimicking real-world data complexity.

Slide 6
Previous Next