FastContext-1.0-4B-RL-NVFP4

NVFP4 (W4A4) quantization of microsoft/FastContext-1.0-4B-RL — a specialized repository-exploration subagent for coding agents.

Credits and Attribution

  • Base Model: microsoft/FastContext-1.0-4B-RL by Microsoft (MIT License). Built on Qwen3-4B-Instruct by Alibaba Qwen Team.
  • Quantization Tool: NVIDIA Model Optimizer (ModelOpt) v0.44.0 by NVIDIA.
  • Calibration Data: CNN/DailyMail by See et al. (Apache 2.0).
  • Paper: Zhang et al., "FastContext: Training Efficient Repository Explorer for Coding Agents," arXiv:2606.14066, 2026.
  • Quantization © 2026 r0b0tlab; base model © Microsoft, MIT License; calibration data © See et al., Apache 2.0; distributed under MIT License.

Quantization Details

Property Value
Source model microsoft/FastContext-1.0-4B-RL (BF16, 7.6 GB)
Quantization NVFP4 (W4A4, group_size=16)
Tool NVIDIA ModelOpt 0.44.0 (NVFP4_DEFAULT_CFG)
Calibration CNN/DailyMail, 512 samples × 1024 tokens × batch 16
Output size 2.7 GB (2.8× compression)
Quantized layers 903 (all attention QKV/O + MLP linear layers)
Excluded Norms, biases, lm_head (tied to embed_tokens)
tie_word_embeddings True

Benchmark Results (NVIDIA GB10 / SM121)

Identical prompt, vLLM 0.23.0, FlashInfer attention, FP8 KV cache:

Metric BF16 Baseline NVFP4 (this model) Ratio
Decode throughput 22.8 tok/s 66.3 tok/s 2.9× faster
TTFT (time to first token) 43 ms 22 ms 2.0× faster
Model size 7.6 GB 2.7 GB 2.8× smaller
GPU power ~15 W ~11 W 1.4× less
GPU temp ~47°C 47°C Same

Matmul-level microbenchmark confirms 2.8–4.5× speedup across all layer types:

  • MLP down_proj [2560, 9728]: 4.48×
  • MLP gate_proj [9728, 2560]: 2.81×
  • Attention Q proj [4096, 2560]: 3.07×
  • Attention O proj [2560, 4096]: 3.89×

How to Serve

vllm serve r0b0tlab/FastContext-1.0-4B-RL-NVFP4 \
    --quantization modelopt \
    --tensor-parallel-size 1 \
    --trust-remote-code \
    --dtype auto \
    --kv-cache-dtype fp8 \
    --attention-backend flashinfer \
    --gpu-memory-utilization 0.40 \
    --max-model-len 131072 \
    --max-num-seqs 16 \
    --enable-chunked-prefill \
    --enable-auto-tool-choice \
    --tool-call-parser hermes \
    --port 30000

Requires an NVFP4-capable NVIDIA GPU. vLLM falls back to EMULATION on older GPUs.

Notes and Limitations

  • This is a post-hoc PTQ quantization, not QAD (Quantization-Aware Distillation). Minor quality regression is possible.
  • The hermes tool-call parser outputs <tool_call> XML in the content field. The FastContext CLI parses this internally.
  • tie_word_embeddings=true: embed_tokens.weight serves as both input embedding and output projection. ModelOpt's tied weight handling correctly preserves this.
  • Benchmark results are from a single NVIDIA GB10 (SM121) device and may vary on other hardware.

BibTeX

@misc{zhang2026fastcontext,
    title={FastContext: Training Efficient Repository Explorer for Coding Agents},
    author={Shaoqiu Zhang and Maoquan Wang and Yuling Shi and Yuhang Wang and Xiaodong Gu and Yongqiang Yao and Rao Fu and Shengyu Fu},
    year={2026},
    eprint={2606.14066},
    archivePrefix={arXiv},
    primaryClass={cs.SE}
}
Downloads last month
16
Safetensors
Model size
2B params
Tensor type
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for r0b0tlab/FastContext-1.0-4B-RL-NVFP4

Quantized
(13)
this model

Paper for r0b0tlab/FastContext-1.0-4B-RL-NVFP4