FastContext: Training Efficient Repository Explorer for Coding Agents
Paper • 2606.14066 • Published • 82
NVFP4 (W4A4) quantization of microsoft/FastContext-1.0-4B-RL — a specialized repository-exploration subagent for coding agents.
| Property | Value |
|---|---|
| Source model | microsoft/FastContext-1.0-4B-RL (BF16, 7.6 GB) |
| Quantization | NVFP4 (W4A4, group_size=16) |
| Tool | NVIDIA ModelOpt 0.44.0 (NVFP4_DEFAULT_CFG) |
| Calibration | CNN/DailyMail, 512 samples × 1024 tokens × batch 16 |
| Output size | 2.7 GB (2.8× compression) |
| Quantized layers | 903 (all attention QKV/O + MLP linear layers) |
| Excluded | Norms, biases, lm_head (tied to embed_tokens) |
tie_word_embeddings |
True |
Identical prompt, vLLM 0.23.0, FlashInfer attention, FP8 KV cache:
| Metric | BF16 Baseline | NVFP4 (this model) | Ratio |
|---|---|---|---|
| Decode throughput | 22.8 tok/s | 66.3 tok/s | 2.9× faster |
| TTFT (time to first token) | 43 ms | 22 ms | 2.0× faster |
| Model size | 7.6 GB | 2.7 GB | 2.8× smaller |
| GPU power | ~15 W | ~11 W | 1.4× less |
| GPU temp | ~47°C | 47°C | Same |
Matmul-level microbenchmark confirms 2.8–4.5× speedup across all layer types:
vllm serve r0b0tlab/FastContext-1.0-4B-RL-NVFP4 \
--quantization modelopt \
--tensor-parallel-size 1 \
--trust-remote-code \
--dtype auto \
--kv-cache-dtype fp8 \
--attention-backend flashinfer \
--gpu-memory-utilization 0.40 \
--max-model-len 131072 \
--max-num-seqs 16 \
--enable-chunked-prefill \
--enable-auto-tool-choice \
--tool-call-parser hermes \
--port 30000
Requires an NVFP4-capable NVIDIA GPU. vLLM falls back to EMULATION on older GPUs.
hermes tool-call parser outputs <tool_call> XML in the content field. The FastContext CLI parses this internally.tie_word_embeddings=true: embed_tokens.weight serves as both input embedding and output projection. ModelOpt's tied weight handling correctly preserves this.@misc{zhang2026fastcontext,
title={FastContext: Training Efficient Repository Explorer for Coding Agents},
author={Shaoqiu Zhang and Maoquan Wang and Yuling Shi and Yuhang Wang and Xiaodong Gu and Yongqiang Yao and Rao Fu and Shengyu Fu},
year={2026},
eprint={2606.14066},
archivePrefix={arXiv},
primaryClass={cs.SE}
}
Base model
Qwen/Qwen3-4B-Instruct-2507