fraQtl D2 β€” KV-cache compression for Mistral-7B-Instruct-v0.3

Two small sidecar files that compress the KV cache of Mistral-7B in llama.cpp to ~6.6 bits/token (below 8-bit Q8), letting you run 128K context in ~12.4 GB instead of ~22 GB at near-lossless quality. No retraining, no fine-tune, no custom inference server β€” drop-in sidecars + a patched llama.cpp.

VRAM

TL;DR (all numbers from hardened, paired, greedy, reproducible runs)

  • Memory @128K: D2 12.4 GB vs Q8 15.1 GB vs fp16 22.1 GB (peak, A100-80GB) β€” βˆ’44% vs fp16, βˆ’18% vs Q8.
  • Retrieval (needle-in-haystack, 63 paired cells, 32K–128K): correct answer recovered in 62/63 (one genuine digit error; two cosmetic hyphen-splits where every digit is correct). Exact-string match: 60/63.
  • Real long-doc QA (LongBench @64K): D2 matches or beats fp16 on 3 of 4 subsets, and beats Q8 on the quantization-sensitive one (qasper).

LongBench

What this is / isn't

  • Is: a memory win β€” Q8-class retrieval quality at below-Q8 memory, on a full 128K context, with public receipts.
  • Isn't: lossless, and not a retrieval win over Q8 (Q8 hits 63/63 exact; D2 hits 60/63 exact / 62/63 content). The value is memory at near-equal quality. No speed claim β€” this is a memory technique.

Recipe (locked)

K16 INT3 + uniform V32 INT3, sink=8 tokens, residual window=1024, YaRN above 32K. Both K and V are rotated into a per-(layer,head) eigenbasis; the top directions are kept fp16, the tail is INT3 β€” no dimensions are dropped (full geometry, mixed precision).

Full results

KV memory (peak nvidia-smi, GB)

ctx fp16 Q8 D2 D2 vs fp16 D2 vs Q8
32K 9.1 7.3 6.8 βˆ’25% βˆ’7%
64K 13.4 9.8 8.6 βˆ’36% βˆ’12%
128K 22.1 15.1 12.4 βˆ’44% βˆ’18%

NIAH retrieval (paired; 7 depths Γ— 3 distinct needles Γ— 3 ctx = 63)

ctx fp16 Q8 D2 (exact) D2 (content)
32K 21/21 21/21 21/21 21/21
64K 21/21 21/21 20/21 20/21
128K 21/21 21/21 19/21 21/21
total 63/63 63/63 60/63 62/63

The 3 exact-match misses: one digit flip (…43045…→…43035…, 64K β€” genuine), two hyphen-splits (…35099…→…350-99…, all digits correct β€” cosmetic).

LongBench QA F1 @64K (16 longest samples/subset, official prompts)

Subset fp16 Q8 D2
hotpotqa 0.265 0.265 0.274
2wikimqa 0.229 0.229 0.292
multifieldqa_en 0.428 0.429 0.470
qasper 0.542 0.456 0.479

How it was measured (so you can trust it)

Every run: greedy decode (--temp 0 --top-k 1, verified from the engine's own sampler log), context length token-verified to β‰₯95% of target (no "fake 128K"), three arms (fp16 / Q8 / D2) on byte-identical prompts, runtime markers asserting the compressed path actually executed, and peak VRAM from a background nvidia-smi poller. Failures are classified against the fp16 baseline so a model limitation is never blamed on D2.

Reproduce it

You need: a CUDA GPU (A100-80GB used here), the patched llama.cpp fork, the GGUF (bartowski/Mistral-7B-Instruct-v0.3-GGUF, Q4_K_M), and the two sidecars in this repo.

# build the fork
cmake -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release -DCMAKE_CUDA_ARCHITECTURES=80
cmake --build build -j --target llama-completion

# run D2 at 128K
FRAQTL_V_DESC_COMPACT=1 FRAQTL_DISABLE_SCRATCH_PREALLOC=1 \
FRAQTL_SKIP_DEAD_STAGING=1 FRAQTL_K_TAIL_BITS=3 FRAQTL_KV_UBATCH_CAP=2048 \
./build/bin/llama-completion \
  -m Mistral-7B-Instruct-v0.3-Q4_K_M.gguf -c 131072 -ngl 99 -fa on \
  -ub 2048 -b 2048 --temp 0 --top-k 1 -no-cnv -n 64 \
  --fraqtl-kv \
  --fraqtl-eigenbasis   sidecar_real_u_mistral-7b-instruct-v0.3.bin \
  --fraqtl-kv-protect   32 \
  --fraqtl-k-eigenbasis mistral-7b-instruct-v0.3-k16-int3.fraqtl-k-eigenbasis.bin \
  --fraqtl-sink-tokens  8 \
  --fraqtl-residual-window 1024 \
  --rope-scaling yarn --rope-scale 4.0 --yarn-orig-ctx 32768 \
  -f your_128k_prompt.txt

# baseline arms for comparison
#   fp16:  (drop all --fraqtl flags and env)
#   Q8:    --cache-type-k q8_0 --cache-type-v q8_0

The exact paired NIAH/RULER/LongBench harnesses are in scripts/modal/s2_d2_clean_harnesses/ in the fork repo.

Caveats

  • One model (Mistral-7B-Instruct-v0.3 Q4_K_M), one recipe. n=16/LongBench subset, 3 needles/NIAH cell β€” disclosed-rate evidence, not tight CIs.
  • Single-sequence inference; KV-cache shift/defrag not supported under compression. Memory technique only β€” no throughput claim.
  • Patched llama.cpp fork required (custom CUDA kernels); not upstreamed.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for fraQtl/fraqtl-d2-mistral-7b-instruct-v0.3

Finetuned
(510)
this model