GRM-2.6-Plus-NVFP4

NVFP4 post-training quantization of OrionLLM/GRM-2.6-Plus produced with NVIDIA ModelOpt on a single NVIDIA RTX PRO 6000 Blackwell Workstation Edition.

Quantization

  • Quant config: ModelOpt NVFP4_DEFAULT_CFG
  • Scheme: ModelOpt standard NVFP4 dynamic 4-bit quantization with the preset's built-in exclusions for lm_head, output layers, routing gates, and convolutional linear-attention components.
  • Tooling: nvidia-modelopt via mtq.quantize and export_hf_checkpoint.
  • Calibration: 512 samples from cnn_dailymail, sequence length 512, batch size 2.

Runtime

Use a recent vLLM build with ModelOpt quantization support on NVIDIA Blackwell:

vllm serve rressl/GRM-2.6-Plus-NVFP4 \
  --quantization modelopt \
  --max-model-len 262144 \
  --gpu-memory-utilization 0.90 \
  --kv-cache-dtype fp8 \
  --attention-backend flashinfer \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3 \
  --trust-remote-code

NVFP4 requires Blackwell-class NVIDIA hardware for the fast path.

Downloads last month
-
Safetensors
Model size
15B params
Tensor type
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ressl/GRM-2.6-Plus-NVFP4

Base model

Qwen/Qwen3.6-27B
Quantized
(9)
this model