Buckets:

leideng
/

QCFuse

466 GB

2,808 files

Updated 5 days ago

Ctrl+K

Name	Size	Uploaded	Xet hash
blend		5 days ago	7 items
cache		5 days ago	1,600 items
data		5 days ago	18 items
lang		5 days ago	13 items
md		5 days ago	2 items
srt		5 days ago	1,026 items
test		5 days ago	46 items
third_party		5 days ago	71 items
.gitignore	3.85 kB xet	5 days ago	cabf7403
.python-version	5 Bytes xet	5 days ago	40141211
README.md	5.3 kB xet	5 days ago	89f5d099
__init__.py	1.82 kB xet	5 days ago	3173fd9b
apply_sglang_patch.sh	2.6 kB xet	5 days ago	b392935e
bench_offline_throughput.py	14.8 kB xet	5 days ago	a32a4a09
bench_one_batch.py	23.8 kB xet	5 days ago	de5d799c
bench_one_batch_server.py	26.6 kB xet	5 days ago	7a31b144
bench_serving.py	94 kB xet	5 days ago	eb0b9ba8
check_env.py	8.41 kB xet	5 days ago	6dad5fc8
compile_deep_gemm.py	6.61 kB xet	5 days ago	aa7ab84f
global_config.py	767 Bytes xet	5 days ago	0eebcfd0
launch_server.py	620 Bytes xet	5 days ago	72a2ac8d
main.py	84 Bytes xet	5 days ago	c1b898d8
profiler.py	4.85 kB xet	5 days ago	1b230af8
pyproject.toml	598 Bytes xet	5 days ago	61ee52b2
run_fullattn.sh	207 Bytes xet	5 days ago	15c467d5
run_fullattn_2wikimqa_0616.txt	9.47 kB xet	5 days ago	c497b8ac
run_longbench_preprocess.sh	292 Bytes xet	5 days ago	6cd4a36e
run_qcfuse.sh	203 Bytes xet	5 days ago	1323dab4
run_qcfuse_2wikimqa_0616.txt	9.47 kB xet	5 days ago	9e808654
run_ruler_preprocess.sh	376 Bytes xet	5 days ago	7e18779a
utils.py	17.3 kB xet	5 days ago	83076d97
uv.lock	845 kB xet	5 days ago	d594bc35
version.py	22 Bytes xet	5 days ago	6d14070f

README.md

QCFuse

QCFuse is a pipeline-constrained, query-aware KV cache fusion system for efficient long-context RAG generation. This repository contains the QCFuse research artifact described in arXiv:2606.05875.

✨ Highlights

QCFuse framework overview

QCFuse builds a compact query-aware view for pipelined cache fusion in RAG serving.

Query-aware compressed view. QCFuse shortens selector analysis time, reduces selection-signal noise, and better balances query awareness with pipeline execution efficiency.
Matched-quality speedup. Under matched-quality comparisons, QCFuse achieves an average prefill-time speedup of 1.7x over full prefill and 1.5x over ProphetKV, the strongest quality-preserving baseline.
Fastest under strict quality control. Under a 1% relative quality drop criterion, QCFuse is the only compared method that satisfies the constraint, with a 1.9x average TTFT speedup over full prefill.

📊 Results

Quality and TTFT trade-off on LongBench and RULER

Quality and TTFT trade-off on LongBench and RULER.

🗂️ Repository Layout

QCFuse/
├── blend/                         # QCFuse evaluation runner and configs
│   ├── sglang_blend_ssd.py         
│   ├── blend_common.py             
│   ├── qcfuse_config.py            
│   └── utils.py      
├── srt/                           # SGLang runtime changes for QCFuse
│   ├── entrypoints/                
│   ├── managers/                   
│   ├── layers/attention/           
│   ├── models/                     
│   └── utils/                      
├── data/                          # Dataset preprocessing               
│   ├── build_longbench_data.py      
│   └── build_ruler_data.py

🗄️ Datasets

The evaluation runner expects each evaluation split as a local JSONL file named {dataset}.jsonl under --data_dir.

Use the scripts in data/ to build the evaluation splits reproducibly. See data/README.md for preprocessing details.

Benchmark	Official source	Tasks used in this artifact
LongBench	THUDM/LongBench	`musique`, `2wikimqa`, `hotpotqa`
RULER	NVIDIA/RULER	`ruler_mv` (`MV`), `ruler_mq` (`MQ`), `ruler_vt` (`VT`)

⚙️ Installation

Option A: uv (recommended)

From the QCFuse repository root:

uv sync
bash apply_sglang_patch.sh

uv sync installs SGLang 0.5.4 from upstream, but QCFuse also ships modified SGLang runtime code under srt/. Run apply_sglang_patch.sh to overlay those changes onto the installed sglang.srt package.

Re-run bash apply_sglang_patch.sh after any uv sync that reinstalls sglang.

Option B: pip editable install

Install SGLang 0.5.4:

git clone -b v0.5.4 https://github.com/sgl-project/sglang.git
cd sglang
rm -rf python/sglang/srt
cp -r /path/to/QCFuse/srt python/sglang/srt
pip install --upgrade pip
pip install -e "python"

Return to the QCFuse repository root before running the commands below.

Evaluation dependencies

If not using uv sync, install the evaluation dependencies used by the Blend runner:

pip install rouge-score

Use a CUDA/PyTorch environment compatible with your GPU and SGLang 0.5.4. The runner expects local model files and local JSONL datasets.

🚀 Running QCFuse

Run the SSD-backed QCFuse method:

python blend/sglang_blend_ssd.py \
  --model qwen3-8b \
  --model_dir models \
  --data_dir data/final_data \
  --dataset hotpotqa \
  --baseline ours \
  --size 200 \
  --cache_dir cache/qcfuse

--cache_dir stores the SSD-backed chunk and query caches. With --baseline ours, the runner performs offline cache preparation before the online evaluation pass.

Run the full-prefill baseline:

python blend/sglang_blend_ssd.py \
  --model qwen3-8b \
  --model_dir models \
  --data_dir data/final_data \
  --dataset hotpotqa \
  --baseline fullcomp \
  --size 200 \
  --cache_dir cache/qcfuse

Supported --baseline values are ours and fullcomp. Supported --dataset values are hotpotqa, 2wikimqa, musique, ruler_mv, ruler_mq, and ruler_vt.

📚 Citation

If you find QCFuse useful, please cite:

@misc{yan2026qcfusequeryawarecachefusion,
      title={QCFuse: Query-Aware Cache Fusion via Compressed View for Efficient RAG Serving},
      author={Jianxin Yan and Wangze Ni and Zhenxin Li and Jiabao Jin and Zhitao Shen and Haoyang Li and Jia Zhu and Peng Cheng and Xuemin Lin and Lei Chen and Kui Ren},
      year={2026},
      eprint={2606.05875},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2606.05875},
}

Total size: 466 GB

Files: 2,808

Last updated: Jun 16

Pre-warmed CDN: US EU US EU