Unweight: how we compressed an LLM 22% without sacrificing quality
AI-Generated Summary: This is an automated summary created using AI. For the full details and context, please read the original post.
Cloudflare's Unweight: A Lossless Compression System for LLM Weights
Cloudflare has developed Unweight, a lossless compression system that reduces Large Language Model (LLM) weights by up to 22% without sacrificing quality. This breakthrough enables faster and cheaper inference on Cloudflare's network. Unweight achieves this by selectively compressing model weights, resulting in a 15-22% reduction in model size and 3 GB VRAM savings. The system is designed to work seamlessly with Cloudflare's Rust-based inference engine and NVIDIA H100 GPUs.
Key Technical Details
Unweight uses a combination of techniques to achieve lossless compression, including entropy coding and selective compression of model weights. The system selects from multiple execution strategies, prioritizing simplicity or minimizing memory traffic, and uses an autotuner to pick the best one per weight matrix and batch size. Unweight's runtime decompresses weights in fast on-chip memory and feeds them directly to the tensor cores, avoiding an extra round-trip through slow main memory.
Practical Implications for Developers
Unweight's lossless compression system opens up new possibilities for developers working with LLMs. By reducing model size and VRAM requirements, developers can:
- Run more models on a single GPU, making inference cheaper and faster
- Fit more models on a single GPU, enabling more diverse use cases
- Take advantage of Cloudflare's Rust-based inference engine and NVIDIA H100 GPUs
Open-Source and Technical Paper
Cloudflare has published a technical paper and open-sourced the GPU kernels for Unweight, encouraging innovation in this rapidly developing space. The initial results on Llama-3.1-8B show ~30% compression of Multi-Layer Perceptron (MLP) weights alone.
Want to read the full article?
Read Full Post on Cloudflare Blog