Compress your LLM's KV cache 33x with zero training
Running out of GPU memory at long context lengths? The KV cache grows linearly with sequence length — at 128K tokens, a 7B model accumulates over 60 GB of KV state. That's more than a single A100. ...

Source: DEV Community
Running out of GPU memory at long context lengths? The KV cache grows linearly with sequence length — at 128K tokens, a 7B model accumulates over 60 GB of KV state. That's more than a single A100. I built NexusQuant, a library that compresses the KV cache 10-33x at inference time. No training, no calibration data, no model changes. Before # OOM at 32K tokens on a 24GB GPU output = model.generate(input_ids, max_new_tokens=512) After from nexusquant import nexusquant_evict with nexusquant_evict(model, quality="balanced"): output = model.generate(input_ids, max_new_tokens=512) 128K context now fits in the memory that used to hold 7.5K. What it does Six stages, applied once after the prefill pass: Rank tokens by attention importance Drop the lowest-scoring tokens (token eviction) Undo rotary position embeddings on keys Apply Hadamard rotation to spread energy uniformly Quantize 8-float groups onto the E8 lattice (densest sphere packing in 8D) Delta-code consecutive indices and compress wit