QCFuse
QCFuse is a pipeline-constrained, query-aware KV cache fusion system for efficient long-context RAG generation. This repository contains the QCFuse research artifact described in arXiv:2606.05875.
✨ Highlights
QCFuse builds a compact query-aware view for pipelined cache fusion in RAG serving.
- Query-aware compressed view. QCFuse shortens selector analysis time, reduces selection-signal noise, and better balances query awareness with pipeline execution efficiency.
- Matched-quality speedup. Under matched-quality comparisons, QCFuse achieves an average prefill-time speedup of 1.7x over full prefill and 1.5x over ProphetKV, the strongest quality-preserving baseline.
- Fastest under strict quality control. Under a 1% relative quality drop criterion, QCFuse is the only compared method that satisfies the constraint, with a 1.9x average TTFT speedup over full prefill.
📊 Results
Quality and TTFT trade-off on LongBench and RULER.
🗂️ Repository Layout
QCFuse/
├── blend/ # QCFuse evaluation runner and configs
│ ├── sglang_blend_ssd.py
│ ├── blend_common.py
│ ├── qcfuse_config.py
│ └── utils.py
├── srt/ # SGLang runtime changes for QCFuse
│ ├── entrypoints/
│ ├── managers/
│ ├── layers/attention/
│ ├── models/
│ └── utils/
├── data/ # Dataset preprocessing
│ ├── build_longbench_data.py
│ └── build_ruler_data.py
🗄️ Datasets
The evaluation runner expects each evaluation split as a local JSONL file named
{dataset}.jsonl under --data_dir.
Use the scripts in data/ to build the evaluation splits reproducibly. See
data/README.md for preprocessing details.
| Benchmark | Official source | Tasks used in this artifact |
|---|---|---|
| LongBench | THUDM/LongBench | musique, 2wikimqa, hotpotqa |
| RULER | NVIDIA/RULER | ruler_mv (MV), ruler_mq (MQ), ruler_vt (VT) |
⚙️ Installation
Option A: uv (recommended)
From the QCFuse repository root:
uv sync
bash apply_sglang_patch.sh
uv sync installs SGLang 0.5.4 from upstream, but QCFuse also ships modified
SGLang runtime code under srt/. Run apply_sglang_patch.sh to overlay those
changes onto the installed sglang.srt package.
Re-run bash apply_sglang_patch.sh after any uv sync that reinstalls sglang.
Option B: pip editable install
Install SGLang 0.5.4:
git clone -b v0.5.4 https://github.com/sgl-project/sglang.git
cd sglang
rm -rf python/sglang/srt
cp -r /path/to/QCFuse/srt python/sglang/srt
pip install --upgrade pip
pip install -e "python"
Return to the QCFuse repository root before running the commands below.
Evaluation dependencies
If not using uv sync, install the evaluation dependencies used by the Blend
runner:
pip install rouge-score
Use a CUDA/PyTorch environment compatible with your GPU and SGLang 0.5.4. The runner expects local model files and local JSONL datasets.
🚀 Running QCFuse
Run the SSD-backed QCFuse method:
python blend/sglang_blend_ssd.py \
--model qwen3-8b \
--model_dir models \
--data_dir data/final_data \
--dataset hotpotqa \
--baseline ours \
--size 200 \
--cache_dir cache/qcfuse
--cache_dir stores the SSD-backed chunk and query caches. With
--baseline ours, the runner performs offline cache preparation before the
online evaluation pass.
Run the full-prefill baseline:
python blend/sglang_blend_ssd.py \
--model qwen3-8b \
--model_dir models \
--data_dir data/final_data \
--dataset hotpotqa \
--baseline fullcomp \
--size 200 \
--cache_dir cache/qcfuse
Supported --baseline values are ours and fullcomp. Supported --dataset
values are hotpotqa, 2wikimqa, musique, ruler_mv, ruler_mq, and
ruler_vt.
📚 Citation
If you find QCFuse useful, please cite:
@misc{yan2026qcfusequeryawarecachefusion,
title={QCFuse: Query-Aware Cache Fusion via Compressed View for Efficient RAG Serving},
author={Jianxin Yan and Wangze Ni and Zhenxin Li and Jiabao Jin and Zhitao Shen and Haoyang Li and Jia Zhu and Peng Cheng and Xuemin Lin and Lei Chen and Kui Ren},
year={2026},
eprint={2606.05875},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2606.05875},
}
- Total size
- 466 GB
- Files
- 2,808
- Last updated
- Jun 16
- Pre-warmed CDN
- US EU US EU