YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

4-bit KV Cache Quantization (Laguna-XS.2)

A study of low-bit KV-cache quantization for vLLM. Full writeup, findings, and implementation notes are in PROBLEM.md.

Finding

The lever for 4-bit KV is the block layout, not the number format or the calibration. Uniform INT4 with per-channel-K (KIVI) blocking β€” a 16-token block per channel for K, per-token for V β€” beats vLLM's shipped NVFP4 / head_dim / absmax baseline by ~25% on K reconstruction error at identical memory (3.56Γ— vs BF16), and the advantage grows with context (neutral below ~512 tokens, ~20–25% KL reduction beyond 1k). 4-bit is the quality floor; INT3 costs ~2–3Γ— the distortion. The catch: per-channel blocking can't ride NVFP4's hardware microscale, so capturing it needs a software INT4 KV path.

Layout

  • kv_quant.py β€” quantization primitives: formats (INT4 / INT3 / NVFP4-e2m1), block layouts (per-head-dim, per-channel), absmax / MSE-optimal calibration, the {format} Γ— {layout} Γ— {calib} sweep, and roundtrip.
  • scripts/quant_sweep.py β€” reconstruction-RMSE grid over real KV activations.
  • scripts/quant_ab.py β€” frozen-page teacher-forced KL A/B vs BF16 (incl. INT3).
  • scripts/quant_longctx.py β€” long-context KL-by-position via an SDPA patch.
  • PROBLEM.md β€” full analysis, vLLM kernel notes, cost/benefit.

Run

uv sync
.venv/bin/python scripts/quant_sweep.py     # RMSE grid
.venv/bin/python scripts/quant_ab.py        # downstream KL A/B
.venv/bin/python scripts/quant_longctx.py   # long-context trend

Scripts target poolside/Laguna-XS.2 and run on the Blackwell node.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support