Back to all summaries

Unweight: how we compressed an LLM 22% without sacrificing quality

Mari Galicer, Ivan Nikulin, Chris Branch
Agents Week Research AI

AI-Generated Summary: This is an automated summary created using AI. For the full details and context, please read the original post.

Cloudflare's Unweight: A Lossless Compression System for LLM Weights

Cloudflare has developed Unweight, a lossless compression system that reduces Large Language Model (LLM) weights by up to 22% without sacrificing quality. This breakthrough enables faster and cheaper inference on Cloudflare's network. Unweight achieves this by selectively compressing model weights, resulting in a 15-22% reduction in model size and 3 GB VRAM savings. The system is designed to work seamlessly with Cloudflare's Rust-based inference engine and NVIDIA H100 GPUs.

Key Technical Details

Unweight uses a combination of techniques to achieve lossless compression, including entropy coding and selective compression of model weights. The system selects from multiple execution strategies, prioritizing simplicity or minimizing memory traffic, and uses an autotuner to pick the best one per weight matrix and batch size. Unweight's runtime decompresses weights in fast on-chip memory and feeds them directly to the tensor cores, avoiding an extra round-trip through slow main memory.

Practical Implications for Developers

Unweight's lossless compression system opens up new possibilities for developers working with LLMs. By reducing model size and VRAM requirements, developers can:

  • Run more models on a single GPU, making inference cheaper and faster
  • Fit more models on a single GPU, enabling more diverse use cases
  • Take advantage of Cloudflare's Rust-based inference engine and NVIDIA H100 GPUs

Open-Source and Technical Paper

Cloudflare has published a technical paper and open-sourced the GPU kernels for Unweight, encouraging innovation in this rapidly developing space. The initial results on Llama-3.1-8B show ~30% compression of Multi-Layer Perceptron (MLP) weights alone.

Want to read the full article?

Read Full Post on Cloudflare Blog