VibeThinker-3B-NVFP4

NVFP4 (W4A4) quantization of WeiboAI/VibeThinker-3B — a 3B reasoning model optimized for math, coding, and STEM tasks.

Compression: 5.8 GB BF16 → 2.20 GB (2.6×)
Performance: 2.60× throughput vs BF16 at c1 (71.3 vs 27.4 tok/s)
Quality: Identity tasks 4/4 correct. Reasoning quality preserved.

Credits and Attribution

  • Base model: WeiboAI/VibeThinker-3B — 3B reasoning model by WeiboAI, fine-tuned from Qwen2.5-Coder-3B (Alibaba/Qwen team)
  • Quantization: NVIDIA Model Optimizer 0.44.0 (NVIDIA/TensorRT-Model-Optimizer)
  • Calibration data: CNN/DailyMail by Abisee et al.
  • Inference: vLLM 0.23.0 (vllm-project/vllm) with FlashInfer CUTLASS NVFP4 backend
  • Hardware: Tested on NVIDIA DGX Spark (GB10, SM121)
  • Prior art: NVIDIA official NVFP4 checkpoints, bg-digitalservices Gemma-4 MoE unfuse plugin, FastContext-4B NVFP4 precedent

Quick Start

# Download from HuggingFace
hf download r0b0tlab/VibeThinker-3B-NVFP4 --local-dir ./vibethinker-3b-nvfp4

# Serve with vLLM 0.22.0+ (pip or Docker)
vllm serve ./vibethinker-3b-nvfp4 \
    --quantization modelopt \
    --kv-cache-dtype fp8 \
    --attention-backend flashinfer \
    --gpu-memory-utilization 0.85 \
    --max-model-len 32768 \
    --enable-prefix-caching \
    --enforce-eager \
    --trust-remote-code

# Or via Docker
docker run --gpus all -v $(pwd):/mnt/model:ro \
    -p 8000:8000 ghcr.io/r0b0tlab/vibethinker-3b-nvfp4

Quantization Recipe

Parameter Value
Tool NVIDIA Model Optimizer 0.44.0
Config NVFP4_DEFAULT_CFG (W4A4)
Group size 16
Calibration dataset abisee/cnn_dailymail 3.0.0
Calibration samples 512
Sequence length 1024
Batch size 16
Export export_hf_checkpoint with torch.inference_mode()
Quantized layers 903 (all Linear layers)
Exclusions lm_head (tied with embed_tokens: tie_word_embeddings=true)

Reproduce

uv venv .venv --python 3.12
source .venv/bin/activate
uv pip install torch --index-url https://download.pytorch.org/whl/cu130
uv pip install "transformers>=5.4" safetensors accelerate datasets
uv pip install "nvidia-modelopt[hf]>=0.44.0"

python3 -c "
import torch, modelopt.torch.quantization as mtq
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from modelopt.torch.export import export_hf_checkpoint

model = AutoModelForCausalLM.from_pretrained('WeiboAI/VibeThinker-3B',
    torch_dtype=torch.bfloat16, device_map='cpu', low_cpu_mem_usage=True)
for n, p in model.named_parameters(): p.data = p.data.to('cuda')
for n, b in model.named_buffers(): b.data = b.data.to('cuda')
tokenizer = AutoTokenizer.from_pretrained('WeiboAI/VibeThinker-3B')
calib = load_dataset('abisee/cnn_dailymail', '3.0.0', split='train[:512]')
def fwd(m):
    for i in range(0, 512, 16):
        b = calib[i:i+16]['article']
        m(**tokenizer(b, return_tensors='pt', padding=True, truncation=True,
            max_length=1024).to('cuda'))
mtq.quantize(model, mtq.NVFP4_DEFAULT_CFG, fwd)
with torch.inference_mode(): export_hf_checkpoint(model, export_dir='./output')
print('Done!')
"

How to Verify

# 1. Download and serve (see Quick Start above)
# 2. Test identity/reasoning
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"messages":[{"role":"user","content":"What is 15 * 7 + 3?"}],"max_tokens":256}'
# Expected: 108 (correct arithmetic through <think> reasoning)

Benchmarks (GB10 / SM121)

c1 tg128 (prefill=2048, output=128, vLLM 0.23.0, random dataset):

Metric BF16 NVFP4 Speedup
Output tok/s 27.4 71.3 2.60×
TTFT 233 ms 51 ms 4.57×
TPOT 34.8 ms 13.6 ms 2.56×

Concurrency ramp (tg128):

Concurrency BF16 tok/s NVFP4 tok/s Speedup
c1 27.4 71.3 2.60×
c2 65.5 128.0 1.96×
c4 123.3 243.1 1.97×
c8 219.5 423.0 1.93×

Depth stability (c1 tg128):

Depth NVFP4 tok/s TTFT
d0 71.3 51 ms
d4096 67.7 76 ms
d8192 60.5 274 ms
d16384 56.4 383 ms

Full benchmark report: github.com/r0b0tlab/vibethinker-3b-nvfp4

hf_quant_config.json

{
    "producer": {"name": "modelopt", "version": "0.44.0"},
    "quantization": {
        "quant_algo": "NVFP4",
        "kv_cache_quant_algo": null,
        "group_size": 16,
        "exclude_modules": ["lm_head"]
    }
}

Notes and Limitations

  • This is an NVFP4 quantization of a reasoning model. The model outputs <think> tags — use --reasoning-parser deepseek_r1 in vLLM for chat use.
  • Quantization scope: All Linear layers (903 quantizers). lm_head excluded (tied with embed_tokens). No multimodal exclusion needed (text-only model).
  • Calibration is text-only. The model has no vision/audio components.
  • Small model dimensions (hidden=2048): This model benefits significantly from NVFP4 FP4 tensor cores. Larger models with hidden ≥ 4096 may show smaller or even negative speedups — always benchmark your specific model.
  • Not re-tested on full benchmark suites: Correctness verified on identity tasks. Full benchmark eval (AIME, LiveCodeBench) not yet reproduced on this quantized checkpoint. The original model's reasoning quality should be preserved given the 4/4 identity test pass.

License

MIT (inherited from base model WeiboAI/VibeThinker-3B).
Quantization artifact copyright 2026 r0b0tlab, distributed under the same MIT license.
Calibration data from CNN/DailyMail (Apache 2.0).

Citation

@misc{vibethinker2026,
  title={VibeThinker: Optimizing Post-training for Small Model Reasoning},
  author={WeiboAI},
  year={2026},
  url={https://huggingface.co/papers/2606.16140}
}

@misc{qwen2.5-coder2025,
  title={Qwen2.5-Coder: Code is More Than Language},
  author={Qwen Team, Alibaba Group},
  year={2025},
  url={https://huggingface.co/Qwen/Qwen2.5-Coder-3B}
}

@misc{modelopt2025,
  title={NVIDIA TensorRT Model Optimizer},
  author={NVIDIA},
  year={2025},
  url={https://github.com/NVIDIA/TensorRT-Model-Optimizer}
}

@misc{see2017cnndailymail,
  title={Get To The Point: Summarization with Pointer-Generator Networks},
  author={See, Abigail and Liu, Peter J. and Manning, Christopher D.},
  year={2017},
  journal={ACL},
  url={https://huggingface.co/datasets/abisee/cnn_dailymail}
}

@misc{vllm2025,
  title={vLLM: Easy, Fast, and Cheap LLM Serving},
  author={vLLM Team},
  year={2025},
  url={https://github.com/vllm-project/vllm}
}
Downloads last month
363
Safetensors
Model size
2B params
Tensor type
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for r0b0tlab/VibeThinker-3B-NVFP4

Base model

Qwen/Qwen2.5-3B
Quantized
(48)
this model

Paper for r0b0tlab/VibeThinker-3B-NVFP4