Neural Network Compression Approach

Learning from the Master: Distilling cmix Knowledge

What if we could train a neural network to mimic cmix's decision-making but run orders of magnitude faster?

The Approach: Supervised Learning from cmix

Step 1
Instrument cmix
Step 2
Create Parquet shards
Step 3
Train PyTorch model
Result
28%
Compression rate
Pipeline Overview:

Input Files (Wikipedia content)
    ↓
[Instrumented cmix compressor]
    ↓
Decision logs: context → symbol → probability
    ↓
[Convert to Parquet format]
    ↓
Training shards (millions of examples)
    ↓
[PyTorch neural network training]
    ↓
Trained model (faster inference than cmix)
    ↓
Compression: 28% size on Wikipedia-style data

Step 1: Instrumenting cmix

Modified cmix source code to log every compression decision it makes

Logged Information:

For each byte position:
• Context buffer (last N bytes)
• All model predictions
  - ICM probabilities
  - ISSE predictions
  - Mixer weights
• Final symbol probability
• Actual symbol encoded
• Arithmetic coder state

Output format:
[context_hash, model_probs[], 
 final_prob, actual_symbol]
Data Volume
~500 GB
Decision logs from compressing 100 MB Wikipedia sample

Step 2: Parquet Sharding

Converted raw logs into efficient columnar storage for training

Parquet Schema:

context: binary (last 32 bytes)
order0_prob: float32[256]
order1_prob: float32[256]
order2_prob: float32[256]
...
mixer_weights: float32[8]
final_prob: float32[256]
target_symbol: uint8

Shard organization:
shard_0000.parquet (10M examples)
shard_0001.parquet (10M examples)
... (50 shards total)
Training Set Size
500M examples
50 shards × 10 million examples each

Step 3: Neural Network Architecture

PyTorch Model (Transformer-based):

Input: context embedding (last 32 bytes)
    ↓
Byte Embedding Layer (256 → 128 dims)
    ↓
Positional Encoding (32 positions)
    ↓
Transformer Encoder (6 layers, 8 heads)
    ↓
Context Mixer (learned weights)
    ↓
Output Layer (256 logits → softmax)
    ↓
Predicted probability distribution P(symbol | context)

Loss function: Cross-entropy between:
  - Neural model prediction
  - cmix final probability (teacher signal)

Training:
  - Batch size: 1024
  - Learning rate: 1e-4 (Adam)
  - Epochs: 10
  - Hardware: 8× A100 GPUs
  - Time: ~72 hours
Model Size
45M params
~180 MB on disk
Inference Speed
~12 MB/s
15× faster than cmix
Compression Rate
28%
0.28 bytes per input byte

Results: Neural Compression Performance

Compressor Compression Rate Ratio Speed (MB/s) Notes
cmix (teacher) 12% 8.3:1 ~0.8 Ground truth, extremely slow
Neural Model (student) 28% 3.6:1 ~12 15× faster, but lower ratio
HoloCodec 20% 5.0:1 ~10 Better ratio, similar speed
7-Zip LZMA2 24% 4.2:1 ~15 Faster, competitive ratio
✓ Speed Improvement
15× faster
Than cmix teacher (0.8 → 12 MB/s)
✗ Compression Gap
-57%
Ratio loss vs cmix (8.3:1 → 3.6:1)
⚠ vs HoloCodec
-28%
Worse compression than HoloCodec (5.0:1 vs 3.6:1)

Why Neural Compression Didn't Make the Cut

Critical Problems:

  1. Knowledge distillation loss: Neural model couldn't capture full complexity of cmix's multi-model mixer
  2. Worse than simpler methods: 28% compression rate vs HoloCodec's 20% — gave up 40% in ratio for marginal speed gain
  3. Training overhead: 72 hours on 8× A100s to train model = expensive infrastructure
  4. Model size: 180 MB model file needs to be distributed with compressor
  5. Inference complexity: Transformer attention O(n²) limits context window

Fundamental Issues:

  • Teacher-student gap: Neural networks struggle to distill algorithmic reasoning (context mixing logic)
  • Overfitting risk: Model trained on Wikipedia may not generalize to other domains
  • GPU dependency: Fast inference requires GPU; CPU-only inference ~2 MB/s (slower than HoloCodec)
  • Non-deterministic: Floating-point ops → slight variations across hardware
  • Black box: Unlike algorithmic compressors, hard to debug or explain failures

Verdict: Promising research direction, but not production-ready. Neural compression is an active research area (DeepMind's Perceiver IO, etc.), but classical algorithmic approaches like HoloCodec's FA-CVM + multi-head architecture still offer better ratio/speed trade-offs for general-purpose compression.