enwik9

The Benchmark Dataset

What Is It?

enwik9 is the first 1 gigabyte (10⁹ bytes) of the English Wikipedia XML dump. It serves as the standard benchmark dataset for the Hutter Prize.

What Makes It Challenging?

The dataset is a complex mixture of:

Natural Language: Human-readable article text in English
XML Structure: Opening/closing tags, attributes, nested hierarchies
Wikipedia Markup: MediaWiki-specific formatting syntax
Metadata: Timestamps, revision IDs, contributor information
Special Characters: Unicode, entities, escaped sequences

Example Structure

<page>
  <title>Article Name</title>
  <revision>
    <timestamp>2024-01-12</timestamp>
    <text>Article text with 
    '''bold''' and [[links]]...
    </text>
  </revision>
</page>

1 GB Exact Size

Mixed Content Type

Why This Dataset?

The heterogeneous nature of enwik9 makes it an excellent test for general-purpose compression algorithms. It requires handling multiple data patterns simultaneously: structured XML, natural language statistics, and repetitive markup—mimicking real-world data complexity.