Unweight: how we compressed an LLM 22% without sacrificing quality

Cloudflare's Unweight: A Lossless Compression System for LLM Weights

Cloudflare has developed Unweight, a lossless compression system that reduces Large Language Model (LLM) weights by up to 22% without sacrificing quality. This breakthrough enables faster and cheaper inference on Cloudflare's network. Unweight achieves this by selectively compressing model weights, resulting in a 15-22% reduction in model size and 3 GB VRAM savings. The system is designed to work seamlessly with Cloudflare's Rust-based inference engine and NVIDIA H100 GPUs.

Key Technical Details

Unweight uses a combination of techniques to achieve lossless compression, including entropy coding and selective compression of model weights. The system selects from multiple execution strategies, prioritizing simplicity or minimizing memory traffic, and uses an autotuner to pick the best one per weight matrix and batch size. Unweight's runtime decompresses weights in fast on-chip memory and feeds them directly to the tensor cores, avoiding an extra round-trip through slow main memory.

Practical Implications for Developers

Unweight's lossless compression system opens up new possibilities for developers working with LLMs. By reducing model size and VRAM requirements, developers can:

Run more models on a single GPU, making inference cheaper and faster
Fit more models on a single GPU, enabling more diverse use cases
Take advantage of Cloudflare's Rust-based inference engine and NVIDIA H100 GPUs

Open-Source and Technical Paper

Cloudflare has published a technical paper and open-sourced the GPU kernels for Unweight, encouraging innovation in this rapidly developing space. The initial results on Llama-3.1-8B show ~30% compression of Multi-Layer Perceptron (MLP) weights alone.