enwik9 is the first 1 gigabyte (109 bytes)
of the English Wikipedia XML dump. It serves as the standard benchmark dataset for the Hutter Prize.
The dataset is a complex mixture of:
<page>
<title>Article Name</title>
<revision>
<timestamp>2024-01-12</timestamp>
<text>Article text with
'''bold''' and [[links]]...
</text>
</revision>
</page>
The heterogeneous nature of enwik9 makes it an excellent test for general-purpose compression algorithms. It requires handling multiple data patterns simultaneously: structured XML, natural language statistics, and repetitive markup—mimicking real-world data complexity.