atomicmilkshake's picture
Add README
402c910 verified
metadata
tags:
  - llama-cpp
  - turboquant
  - triattention
  - kv-cache
  - windows
  - cuda
license: mit

llama.cpp TurboQuant + TriAttention — Windows CUDA 13 Binaries

Pre-built Windows x64 Release binaries for the atomicmilkshake/llama-cpp-turboquant fork.

This builds adds TurboQuant (custom quantization) and TriAttention (GPU-accelerated KV cache pruning based on arXiv 2604.04921) on top of llama.cpp.

Download

llama-turboquant-triattention-win-cu13-x64.zip (~179 MB)

Requirements

  • Windows 10/11 x64
  • NVIDIA GPU (Turing+, GTX 1600 / RTX 2000 series or newer)
  • CUDA 13.x runtime — install from developer.nvidia.com/cuda-downloads (the cublasLt64_13.dll is NOT included in the zip due to its 432 MB size)

Usage

llama-server.exe -m YourModel.gguf -c 32768 -ngl 99 --port 8080 ^ --triattention-stats model.triattention ^ --triattention-budget 4096 ^ --triattention-window 256 ^ --triattention-log

TriAttention Performance

Tested on Qwen3-8B Q4_K_M, RTX 3080, -c 512, udget=256:

Mode Prune time Generation
No pruning 17.5 tok/s
CPU scoring ~5900 ms/event 17.5 tok/s
GPU scoring ~4-9 ms/event 75.0 tok/s

~1000x speedup on pruning events; 4.3x overall throughput improvement.

TriAttention Flags

Flag Description Default
--triattention-stats Calibration file (required to enable)
--triattention-budget Max KV tokens to retain 512
--triattention-window Recent-token protection window 64
--triattention-trigger slack|interval| ill slack
--triattention-log Log each prune event off
--triattention-no-protect-prefill Allow evicting prompt tokens off

Source

github.com/atomicmilkshake/llama-cpp-turboquant — branch eature/triattention