YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
4-bit KV Cache Quantization (Laguna-XS.2)
A study of low-bit KV-cache quantization for vLLM. Full writeup, findings, and implementation notes are in PROBLEM.md.
Finding
The lever for 4-bit KV is the block layout, not the number format or the calibration. Uniform INT4 with per-channel-K (KIVI) blocking β a 16-token block per channel for K, per-token for V β beats vLLM's shipped NVFP4 / head_dim / absmax baseline by ~25% on K reconstruction error at identical memory (3.56Γ vs BF16), and the advantage grows with context (neutral below ~512 tokens, ~20β25% KL reduction beyond 1k). 4-bit is the quality floor; INT3 costs ~2β3Γ the distortion. The catch: per-channel blocking can't ride NVFP4's hardware microscale, so capturing it needs a software INT4 KV path.
Layout
kv_quant.pyβ quantization primitives: formats (INT4 / INT3 / NVFP4-e2m1), block layouts (per-head-dim, per-channel), absmax / MSE-optimal calibration, the{format} Γ {layout} Γ {calib}sweep, androundtrip.scripts/quant_sweep.pyβ reconstruction-RMSE grid over real KV activations.scripts/quant_ab.pyβ frozen-page teacher-forced KL A/B vs BF16 (incl. INT3).scripts/quant_longctx.pyβ long-context KL-by-position via an SDPA patch.PROBLEM.mdβ full analysis, vLLM kernel notes, cost/benefit.
Run
uv sync
.venv/bin/python scripts/quant_sweep.py # RMSE grid
.venv/bin/python scripts/quant_ab.py # downstream KL A/B
.venv/bin/python scripts/quant_longctx.py # long-context trend
Scripts target poolside/Laguna-XS.2 and run on the Blackwell node.