DBTrimKV is the dynamic-budget variant of TrimKV: a single global KV budget is shared across layers and heads and reallocated on the fly, with the retention-gate's final projection tied across layers.

This repository hosts the DBTrimKV retention-gate weights for Qwen/Qwen3-4B-Instruct-2507 (131072-token training context, M = 512). The base-model weights are not included — they are loaded from Qwen/Qwen3-4B-Instruct-2507 at runtime and the retention-gate weights from trimkv_weights.pth are overlaid on top.

This model was introduced in the paper Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction.

For the full list of released checkpoints, training recipes, and benchmark scripts, see the GitHub repository: https://github.com/ngocbh/trimkv.

Quick start

To use this model, you need to install the trimkv library from the official repository.

import torch
from trimkv.models.qwen3 import TrimKVQwen3ForCausalLM
from trimkv.cache_utils import PagedTrimKVCache
from transformers import AutoTokenizer

model = TrimKVQwen3ForCausalLM.from_pretrained(
    "ngocbh/DBTrimKV-Qwen3-4B-Instruct-2507",
    torch_dtype=torch.bfloat16,
    load_trimkv_weights=True,
    download_from="huggingface",
    use_cache=True,
    device_map="cuda",
)
model.config._attn_implementation = "flash_attention_2"

tokenizer = AutoTokenizer.from_pretrained(
    model.config.base_model, use_fast=True, padding_side="left"
)

past_key_values = PagedTrimKVCache(
    num_layers=model.config.num_hidden_layers,
    num_heads=model.config.num_key_value_heads,
    max_seq_len=32768,
    memory_size=512,
    num_blocks_ratio=1.0,
    buffer_size=32,
    strategy="fixed_budget",
    device="cuda",
)

# Use as a normal HF model — pass `past_key_values=past_key_values` to .generate

See examples/test_qwen3.py in the GitHub repo for a full runnable example.

Training details

Base model: Qwen/Qwen3-4B-Instruct-2507
Variant: DBTrimKV (retention_gate=rg10)
Training dataset: Synth-Long, BookSum, Buddhi
Training memory size M: 512
Training context length: 131072
Loss: fwkl_ntp
Attention impl: rg_attn_flex

Citation

@article{bui2025make,
  title={Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction},
  author={Bui, Ngoc and Nguyen, Hieu Trung and Cohan, Arman and Ying, Rex},
  journal={arXiv preprint arXiv:2512.03324},
  year={2025}
}

Downloads last month: 39

Model tree for ngocbh/DBTrimKV-Qwen3-4B-Instruct-2507

Base model

Qwen/Qwen3-4B-Instruct-2507

Finetuned

(1703)

this model

Collection including ngocbh/DBTrimKV-Qwen3-4B-Instruct-2507

TrimKV

Collection

A set of models that can run with bounded memory • 13 items • Updated 9 days ago • 1

Papers for ngocbh/DBTrimKV-Qwen3-4B-Instruct-2507

Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction

Paper • 2605.09649 • Published 11 days ago • 11

Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs

Paper • 2512.03324 • Published Dec 3, 2025 • 2