Neural Network Compression Approach

Learning from the Master: Distilling cmix Knowledge

What if we could train a neural network to mimic cmix's decision-making but run orders of magnitude faster?

The Approach: Supervised Learning from cmix

Step 1

Instrument cmix

Step 2

Create Parquet shards

Step 3

Train PyTorch model

Result

28%

Compression rate

Pipeline Overview:

Input Files (Wikipedia content)
    ↓
[Instrumented cmix compressor]
    ↓
Decision logs: context → symbol → probability
    ↓
[Convert to Parquet format]
    ↓
Training shards (millions of examples)
    ↓
[PyTorch neural network training]
    ↓
Trained model (faster inference than cmix)
    ↓
Compression: 28% size on Wikipedia-style data

Step 1: Instrumenting cmix

Modified cmix source code to log every compression decision it makes

Logged Information:

For each byte position:
• Context buffer (last N bytes)
• All model predictions
  - ICM probabilities
  - ISSE predictions
  - Mixer weights
• Final symbol probability
• Actual symbol encoded
• Arithmetic coder state

Output format:
[context_hash, model_probs[], 
 final_prob, actual_symbol]

Data Volume

~500 GB

Decision logs from compressing 100 MB Wikipedia sample

Step 2: Parquet Sharding

Converted raw logs into efficient columnar storage for training

Parquet Schema:

context: binary (last 32 bytes)
order0_prob: float32[256]
order1_prob: float32[256]
order2_prob: float32[256]
...
mixer_weights: float32[8]
final_prob: float32[256]
target_symbol: uint8

Shard organization:
shard_0000.parquet (10M examples)
shard_0001.parquet (10M examples)
... (50 shards total)

Training Set Size

500M examples

50 shards × 10 million examples each

Step 3: Neural Network Architecture

PyTorch Model (Transformer-based):

Input: context embedding (last 32 bytes)
    ↓
Byte Embedding Layer (256 → 128 dims)
    ↓
Positional Encoding (32 positions)
    ↓
Transformer Encoder (6 layers, 8 heads)
    ↓
Context Mixer (learned weights)
    ↓
Output Layer (256 logits → softmax)
    ↓
Predicted probability distribution P(symbol | context)

Loss function: Cross-entropy between:
  - Neural model prediction
  - cmix final probability (teacher signal)

Training:
  - Batch size: 1024
  - Learning rate: 1e-4 (Adam)
  - Epochs: 10
  - Hardware: 8× A100 GPUs
  - Time: ~72 hours

Model Size

45M params

~180 MB on disk

Inference Speed

~12 MB/s

15× faster than cmix

Compression Rate

28%

0.28 bytes per input byte

Results: Neural Compression Performance

Compressor	Compression Rate	Ratio	Speed (MB/s)	Notes
cmix (teacher)	12%	8.3:1	~0.8	Ground truth, extremely slow
Neural Model (student)	28%	3.6:1	~12	15× faster, but lower ratio
HoloCodec	20%	5.0:1	~10	Better ratio, similar speed
7-Zip LZMA2	24%	4.2:1	~15	Faster, competitive ratio

✓ Speed Improvement

15× faster

Than cmix teacher (0.8 → 12 MB/s)

✗ Compression Gap

-57%

Ratio loss vs cmix (8.3:1 → 3.6:1)

⚠ vs HoloCodec

-28%

Worse compression than HoloCodec (5.0:1 vs 3.6:1)

Why Neural Compression Didn't Make the Cut

Critical Problems:

Knowledge distillation loss: Neural model couldn't capture full complexity of cmix's multi-model mixer
Worse than simpler methods: 28% compression rate vs HoloCodec's 20% — gave up 40% in ratio for marginal speed gain
Training overhead: 72 hours on 8× A100s to train model = expensive infrastructure
Model size: 180 MB model file needs to be distributed with compressor
Inference complexity: Transformer attention O(n²) limits context window

Fundamental Issues:

Teacher-student gap: Neural networks struggle to distill algorithmic reasoning (context mixing logic)
Overfitting risk: Model trained on Wikipedia may not generalize to other domains
GPU dependency: Fast inference requires GPU; CPU-only inference ~2 MB/s (slower than HoloCodec)
Non-deterministic: Floating-point ops → slight variations across hardware
Black box: Unlike algorithmic compressors, hard to debug or explain failures

Verdict: Promising research direction, but not production-ready. Neural compression is an active research area (DeepMind's Perceiver IO, etc.), but classical algorithmic approaches like HoloCodec's FA-CVM + multi-head architecture still offer better ratio/speed trade-offs for general-purpose compression.