fraQtl D2 β KV-cache compression for Mistral-7B-Instruct-v0.3
Two small sidecar files that compress the KV cache of Mistral-7B in llama.cpp to ~6.6 bits/token (below 8-bit Q8), letting you run 128K context in ~12.4 GB instead of ~22 GB at near-lossless quality. No retraining, no fine-tune, no custom inference server β drop-in sidecars + a patched llama.cpp.
TL;DR (all numbers from hardened, paired, greedy, reproducible runs)
- Memory @128K: D2 12.4 GB vs Q8 15.1 GB vs fp16 22.1 GB (peak, A100-80GB) β β44% vs fp16, β18% vs Q8.
- Retrieval (needle-in-haystack, 63 paired cells, 32Kβ128K): correct answer recovered in 62/63 (one genuine digit error; two cosmetic hyphen-splits where every digit is correct). Exact-string match: 60/63.
- Real long-doc QA (LongBench @64K): D2 matches or beats fp16 on 3 of 4 subsets, and beats Q8 on the quantization-sensitive one (qasper).
What this is / isn't
- Is: a memory win β Q8-class retrieval quality at below-Q8 memory, on a full 128K context, with public receipts.
- Isn't: lossless, and not a retrieval win over Q8 (Q8 hits 63/63 exact; D2 hits 60/63 exact / 62/63 content). The value is memory at near-equal quality. No speed claim β this is a memory technique.
Recipe (locked)
K16 INT3 + uniform V32 INT3, sink=8 tokens, residual window=1024, YaRN
above 32K. Both K and V are rotated into a per-(layer,head) eigenbasis;
the top directions are kept fp16, the tail is INT3 β no dimensions are
dropped (full geometry, mixed precision).
Full results
KV memory (peak nvidia-smi, GB)
| ctx | fp16 | Q8 | D2 | D2 vs fp16 | D2 vs Q8 |
|---|---|---|---|---|---|
| 32K | 9.1 | 7.3 | 6.8 | β25% | β7% |
| 64K | 13.4 | 9.8 | 8.6 | β36% | β12% |
| 128K | 22.1 | 15.1 | 12.4 | β44% | β18% |
NIAH retrieval (paired; 7 depths Γ 3 distinct needles Γ 3 ctx = 63)
| ctx | fp16 | Q8 | D2 (exact) | D2 (content) |
|---|---|---|---|---|
| 32K | 21/21 | 21/21 | 21/21 | 21/21 |
| 64K | 21/21 | 21/21 | 20/21 | 20/21 |
| 128K | 21/21 | 21/21 | 19/21 | 21/21 |
| total | 63/63 | 63/63 | 60/63 | 62/63 |
The 3 exact-match misses: one digit flip (β¦43045β¦ββ¦43035β¦, 64K β genuine),
two hyphen-splits (β¦35099β¦ββ¦350-99β¦, all digits correct β cosmetic).
LongBench QA F1 @64K (16 longest samples/subset, official prompts)
| Subset | fp16 | Q8 | D2 |
|---|---|---|---|
| hotpotqa | 0.265 | 0.265 | 0.274 |
| 2wikimqa | 0.229 | 0.229 | 0.292 |
| multifieldqa_en | 0.428 | 0.429 | 0.470 |
| qasper | 0.542 | 0.456 | 0.479 |
How it was measured (so you can trust it)
Every run: greedy decode (--temp 0 --top-k 1, verified from the engine's own
sampler log), context length token-verified to β₯95% of target (no
"fake 128K"), three arms (fp16 / Q8 / D2) on byte-identical prompts,
runtime markers asserting the compressed path actually executed, and
peak VRAM from a background nvidia-smi poller. Failures are classified
against the fp16 baseline so a model limitation is never blamed on D2.
Reproduce it
You need: a CUDA GPU (A100-80GB used here), the patched llama.cpp fork, the
GGUF (bartowski/Mistral-7B-Instruct-v0.3-GGUF, Q4_K_M), and the two sidecars
in this repo.
# build the fork
cmake -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release -DCMAKE_CUDA_ARCHITECTURES=80
cmake --build build -j --target llama-completion
# run D2 at 128K
FRAQTL_V_DESC_COMPACT=1 FRAQTL_DISABLE_SCRATCH_PREALLOC=1 \
FRAQTL_SKIP_DEAD_STAGING=1 FRAQTL_K_TAIL_BITS=3 FRAQTL_KV_UBATCH_CAP=2048 \
./build/bin/llama-completion \
-m Mistral-7B-Instruct-v0.3-Q4_K_M.gguf -c 131072 -ngl 99 -fa on \
-ub 2048 -b 2048 --temp 0 --top-k 1 -no-cnv -n 64 \
--fraqtl-kv \
--fraqtl-eigenbasis sidecar_real_u_mistral-7b-instruct-v0.3.bin \
--fraqtl-kv-protect 32 \
--fraqtl-k-eigenbasis mistral-7b-instruct-v0.3-k16-int3.fraqtl-k-eigenbasis.bin \
--fraqtl-sink-tokens 8 \
--fraqtl-residual-window 1024 \
--rope-scaling yarn --rope-scale 4.0 --yarn-orig-ctx 32768 \
-f your_128k_prompt.txt
# baseline arms for comparison
# fp16: (drop all --fraqtl flags and env)
# Q8: --cache-type-k q8_0 --cache-type-v q8_0
The exact paired NIAH/RULER/LongBench harnesses are in
scripts/modal/s2_d2_clean_harnesses/ in the fork repo.
Caveats
- One model (Mistral-7B-Instruct-v0.3 Q4_K_M), one recipe. n=16/LongBench subset, 3 needles/NIAH cell β disclosed-rate evidence, not tight CIs.
- Single-sequence inference; KV-cache shift/defrag not supported under compression. Memory technique only β no throughput claim.
- Patched llama.cpp fork required (custom CUDA kernels); not upstreamed.
Model tree for fraQtl/fraqtl-d2-mistral-7b-instruct-v0.3
Base model
mistralai/Mistral-7B-v0.3
