fraQtl D2 — KV-cache compression for Mistral-7B-Instruct-v0.3

Two small sidecar files that compress the KV cache of Mistral-7B in llama.cpp to ~6.6 bits/token (below 8-bit Q8), letting you run 128K context in ~12.4 GB instead of ~22 GB at near-lossless quality. No retraining, no fine-tune, no custom inference server — drop-in sidecars + a patched llama.cpp.

TL;DR (all numbers from hardened, paired, greedy, reproducible runs)

Memory @128K: D2 12.4 GB vs Q8 15.1 GB vs fp16 22.1 GB (peak, A100-80GB) — −44% vs fp16, −18% vs Q8.
Retrieval (needle-in-haystack, 63 paired cells, 32K–128K): correct answer recovered in 62/63 (one genuine digit error; two cosmetic hyphen-splits where every digit is correct). Exact-string match: 60/63.
Real long-doc QA (LongBench @64K): D2 matches or beats fp16 on 3 of 4 subsets, and beats Q8 on the quantization-sensitive one (qasper).

What this is / isn't

Is: a memory win — Q8-class retrieval quality at below-Q8 memory, on a full 128K context, with public receipts.
Isn't: lossless, and not a retrieval win over Q8 (Q8 hits 63/63 exact; D2 hits 60/63 exact / 62/63 content). The value is memory at near-equal quality. No speed claim — this is a memory technique.

Recipe (locked)

K16 INT3 + uniform V32 INT3, sink=8 tokens, residual window=1024, YaRN above 32K. Both K and V are rotated into a per-(layer,head) eigenbasis; the top directions are kept fp16, the tail is INT3 — no dimensions are dropped (full geometry, mixed precision).

Full results

KV memory (peak nvidia-smi, GB)

ctx	fp16	Q8	D2	D2 vs fp16	D2 vs Q8
32K	9.1	7.3	6.8	−25%	−7%
64K	13.4	9.8	8.6	−36%	−12%
128K	22.1	15.1	12.4	−44%	−18%

NIAH retrieval (paired; 7 depths × 3 distinct needles × 3 ctx = 63)

ctx	fp16	Q8	D2 (exact)	D2 (content)
32K	21/21	21/21	21/21	21/21
64K	21/21	21/21	20/21	20/21
128K	21/21	21/21	19/21	21/21
total	63/63	63/63	60/63	62/63

The 3 exact-match misses: one digit flip (…43045…→…43035…, 64K — genuine), two hyphen-splits (…35099…→…350-99…, all digits correct — cosmetic).

LongBench QA F1 @64K (16 longest samples/subset, official prompts)

Subset	fp16	Q8	D2
hotpotqa	0.265	0.265	0.274
2wikimqa	0.229	0.229	0.292
multifieldqa_en	0.428	0.429	0.470
qasper	0.542	0.456	0.479

How it was measured (so you can trust it)

Every run: greedy decode (--temp 0 --top-k 1, verified from the engine's own sampler log), context length token-verified to ≥95% of target (no "fake 128K"), three arms (fp16 / Q8 / D2) on byte-identical prompts, runtime markers asserting the compressed path actually executed, and peak VRAM from a background nvidia-smi poller. Failures are classified against the fp16 baseline so a model limitation is never blamed on D2.

Reproduce it

You need: a CUDA GPU (A100-80GB used here), the patched llama.cpp fork, the GGUF (bartowski/Mistral-7B-Instruct-v0.3-GGUF, Q4_K_M), and the two sidecars in this repo.

# build the fork
cmake -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release -DCMAKE_CUDA_ARCHITECTURES=80
cmake --build build -j --target llama-completion

# run D2 at 128K
FRAQTL_V_DESC_COMPACT=1 FRAQTL_DISABLE_SCRATCH_PREALLOC=1 \
FRAQTL_SKIP_DEAD_STAGING=1 FRAQTL_K_TAIL_BITS=3 FRAQTL_KV_UBATCH_CAP=2048 \
./build/bin/llama-completion \
  -m Mistral-7B-Instruct-v0.3-Q4_K_M.gguf -c 131072 -ngl 99 -fa on \
  -ub 2048 -b 2048 --temp 0 --top-k 1 -no-cnv -n 64 \
  --fraqtl-kv \
  --fraqtl-eigenbasis   sidecar_real_u_mistral-7b-instruct-v0.3.bin \
  --fraqtl-kv-protect   32 \
  --fraqtl-k-eigenbasis mistral-7b-instruct-v0.3-k16-int3.fraqtl-k-eigenbasis.bin \
  --fraqtl-sink-tokens  8 \
  --fraqtl-residual-window 1024 \
  --rope-scaling yarn --rope-scale 4.0 --yarn-orig-ctx 32768 \
  -f your_128k_prompt.txt

# baseline arms for comparison
#   fp16:  (drop all --fraqtl flags and env)
#   Q8:    --cache-type-k q8_0 --cache-type-v q8_0

The exact paired NIAH/RULER/LongBench harnesses are in scripts/modal/s2_d2_clean_harnesses/ in the fork repo.

Caveats

One model (Mistral-7B-Instruct-v0.3 Q4_K_M), one recipe. n=16/LongBench subset, 3 needles/NIAH cell — disclosed-rate evidence, not tight CIs.
Single-sequence inference; KV-cache shift/defrag not supported under compression. Memory technique only — no throughput claim.
Patched llama.cpp fork required (custom CUDA kernels); not upstreamed.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for fraQtl/fraqtl-d2-mistral-7b-instruct-v0.3

Base model

mistralai/Mistral-7B-v0.3

Finetuned

mistralai/Mistral-7B-Instruct-v0.3

Finetuned

(510)

this model