FastContext-1.0-4B-RL-NVFP4

NVFP4 (W4A4) quantization of microsoft/FastContext-1.0-4B-RL — a specialized repository-exploration subagent for coding agents.

Credits and Attribution

Base Model: microsoft/FastContext-1.0-4B-RL by Microsoft (MIT License). Built on Qwen3-4B-Instruct by Alibaba Qwen Team.
Quantization Tool: NVIDIA Model Optimizer (ModelOpt) v0.44.0 by NVIDIA.
Calibration Data: CNN/DailyMail by See et al. (Apache 2.0).
Paper: Zhang et al., "FastContext: Training Efficient Repository Explorer for Coding Agents," arXiv:2606.14066, 2026.
Quantization © 2026 r0b0tlab; base model © Microsoft, MIT License; calibration data © See et al., Apache 2.0; distributed under MIT License.

Quantization Details

Property	Value
Source model	microsoft/FastContext-1.0-4B-RL (BF16, 7.6 GB)
Quantization	NVFP4 (W4A4, group_size=16)
Tool	NVIDIA ModelOpt 0.44.0 (`NVFP4_DEFAULT_CFG`)
Calibration	CNN/DailyMail, 512 samples × 1024 tokens × batch 16
Output size	2.7 GB (2.8× compression)
Quantized layers	903 (all attention QKV/O + MLP linear layers)
Excluded	Norms, biases, lm_head (tied to embed_tokens)
`tie_word_embeddings`	True

Benchmark Results (NVIDIA GB10 / SM121)

Identical prompt, vLLM 0.23.0, FlashInfer attention, FP8 KV cache:

Metric	BF16 Baseline	NVFP4 (this model)	Ratio
Decode throughput	22.8 tok/s	66.3 tok/s	2.9× faster
TTFT (time to first token)	43 ms	22 ms	2.0× faster
Model size	7.6 GB	2.7 GB	2.8× smaller
GPU power	~15 W	~11 W	1.4× less
GPU temp	~47°C	47°C	Same

Matmul-level microbenchmark confirms 2.8–4.5× speedup across all layer types:

MLP down_proj [2560, 9728]: 4.48×
MLP gate_proj [9728, 2560]: 2.81×
Attention Q proj [4096, 2560]: 3.07×
Attention O proj [2560, 4096]: 3.89×

How to Serve

vllm serve r0b0tlab/FastContext-1.0-4B-RL-NVFP4 \
    --quantization modelopt \
    --tensor-parallel-size 1 \
    --trust-remote-code \
    --dtype auto \
    --kv-cache-dtype fp8 \
    --attention-backend flashinfer \
    --gpu-memory-utilization 0.40 \
    --max-model-len 131072 \
    --max-num-seqs 16 \
    --enable-chunked-prefill \
    --enable-auto-tool-choice \
    --tool-call-parser hermes \
    --port 30000

Requires an NVFP4-capable NVIDIA GPU. vLLM falls back to EMULATION on older GPUs.

Notes and Limitations

This is a post-hoc PTQ quantization, not QAD (Quantization-Aware Distillation). Minor quality regression is possible.
The hermes tool-call parser outputs <tool_call> XML in the content field. The FastContext CLI parses this internally.
tie_word_embeddings=true: embed_tokens.weight serves as both input embedding and output projection. ModelOpt's tied weight handling correctly preserves this.
Benchmark results are from a single NVIDIA GB10 (SM121) device and may vary on other hardware.

BibTeX

@misc{zhang2026fastcontext,
    title={FastContext: Training Efficient Repository Explorer for Coding Agents},
    author={Shaoqiu Zhang and Maoquan Wang and Yuling Shi and Yuhang Wang and Xiaodong Gu and Yongqiang Yao and Rao Fu and Shengyu Fu},
    year={2026},
    eprint={2606.14066},
    archivePrefix={arXiv},
    primaryClass={cs.SE}
}

Downloads last month: 16

Safetensors

Model size

2B params

Tensor type

BF16

F8_E4M3

Model tree for r0b0tlab/FastContext-1.0-4B-RL-NVFP4

Base model

Qwen/Qwen3-4B-Instruct-2507

Finetuned

microsoft/FastContext-1.0-4B-RL

Quantized

(13)

this model

Paper for r0b0tlab/FastContext-1.0-4B-RL-NVFP4

FastContext: Training Efficient Repository Explorer for Coding Agents

Paper • 2606.14066 • Published 7 days ago • 82