What if we could train a neural network to mimic cmix's decision-making but run orders of magnitude faster?
Pipeline Overview:
Input Files (Wikipedia content)
↓
[Instrumented cmix compressor]
↓
Decision logs: context → symbol → probability
↓
[Convert to Parquet format]
↓
Training shards (millions of examples)
↓
[PyTorch neural network training]
↓
Trained model (faster inference than cmix)
↓
Compression: 28% size on Wikipedia-style data
Modified cmix source code to log every compression decision it makes
Logged Information:
For each byte position:
• Context buffer (last N bytes)
• All model predictions
- ICM probabilities
- ISSE predictions
- Mixer weights
• Final symbol probability
• Actual symbol encoded
• Arithmetic coder state
Output format:
[context_hash, model_probs[],
final_prob, actual_symbol]
Converted raw logs into efficient columnar storage for training
Parquet Schema:
context: binary (last 32 bytes)
order0_prob: float32[256]
order1_prob: float32[256]
order2_prob: float32[256]
...
mixer_weights: float32[8]
final_prob: float32[256]
target_symbol: uint8
Shard organization:
shard_0000.parquet (10M examples)
shard_0001.parquet (10M examples)
... (50 shards total)
PyTorch Model (Transformer-based):
Input: context embedding (last 32 bytes)
↓
Byte Embedding Layer (256 → 128 dims)
↓
Positional Encoding (32 positions)
↓
Transformer Encoder (6 layers, 8 heads)
↓
Context Mixer (learned weights)
↓
Output Layer (256 logits → softmax)
↓
Predicted probability distribution P(symbol | context)
Loss function: Cross-entropy between:
- Neural model prediction
- cmix final probability (teacher signal)
Training:
- Batch size: 1024
- Learning rate: 1e-4 (Adam)
- Epochs: 10
- Hardware: 8× A100 GPUs
- Time: ~72 hours
| Compressor | Compression Rate | Ratio | Speed (MB/s) | Notes |
|---|---|---|---|---|
| cmix (teacher) | 12% | 8.3:1 | ~0.8 | Ground truth, extremely slow |
| Neural Model (student) | 28% | 3.6:1 | ~12 | 15× faster, but lower ratio |
| HoloCodec | 20% | 5.0:1 | ~10 | Better ratio, similar speed |
| 7-Zip LZMA2 | 24% | 4.2:1 | ~15 | Faster, competitive ratio |
Verdict: Promising research direction, but not production-ready. Neural compression is an active research area (DeepMind's Perceiver IO, etc.), but classical algorithmic approaches like HoloCodec's FA-CVM + multi-head architecture still offer better ratio/speed trade-offs for general-purpose compression.