Instructions to use r0b0tlab/VibeThinker-3B-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use r0b0tlab/VibeThinker-3B-NVFP4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="r0b0tlab/VibeThinker-3B-NVFP4")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("r0b0tlab/VibeThinker-3B-NVFP4")
model = AutoModelForCausalLM.from_pretrained("r0b0tlab/VibeThinker-3B-NVFP4")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use r0b0tlab/VibeThinker-3B-NVFP4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "r0b0tlab/VibeThinker-3B-NVFP4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "r0b0tlab/VibeThinker-3B-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/r0b0tlab/VibeThinker-3B-NVFP4

SGLang

How to use r0b0tlab/VibeThinker-3B-NVFP4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "r0b0tlab/VibeThinker-3B-NVFP4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "r0b0tlab/VibeThinker-3B-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "r0b0tlab/VibeThinker-3B-NVFP4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "r0b0tlab/VibeThinker-3B-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use r0b0tlab/VibeThinker-3B-NVFP4 with Docker Model Runner:
```
docker model run hf.co/r0b0tlab/VibeThinker-3B-NVFP4
```

VibeThinker-3B-NVFP4

NVFP4 (W4A4) quantization of WeiboAI/VibeThinker-3B — a 3B reasoning model optimized for math, coding, and STEM tasks.

Compression: 5.8 GB BF16 → 2.20 GB (2.6×)
Performance: 2.60× throughput vs BF16 at c1 (71.3 vs 27.4 tok/s)
Quality: Identity tasks 4/4 correct. Reasoning quality preserved.

Credits and Attribution

Base model: WeiboAI/VibeThinker-3B — 3B reasoning model by WeiboAI, fine-tuned from Qwen2.5-Coder-3B (Alibaba/Qwen team)
Quantization: NVIDIA Model Optimizer 0.44.0 (NVIDIA/TensorRT-Model-Optimizer)
Calibration data: CNN/DailyMail by Abisee et al.
Inference: vLLM 0.23.0 (vllm-project/vllm) with FlashInfer CUTLASS NVFP4 backend
Hardware: Tested on NVIDIA DGX Spark (GB10, SM121)
Prior art: NVIDIA official NVFP4 checkpoints, bg-digitalservices Gemma-4 MoE unfuse plugin, FastContext-4B NVFP4 precedent

Quick Start

# Download from HuggingFace
hf download r0b0tlab/VibeThinker-3B-NVFP4 --local-dir ./vibethinker-3b-nvfp4

# Serve with vLLM 0.22.0+ (pip or Docker)
vllm serve ./vibethinker-3b-nvfp4 \
    --quantization modelopt \
    --kv-cache-dtype fp8 \
    --attention-backend flashinfer \
    --gpu-memory-utilization 0.85 \
    --max-model-len 32768 \
    --enable-prefix-caching \
    --enforce-eager \
    --trust-remote-code

# Or via Docker
docker run --gpus all -v $(pwd):/mnt/model:ro \
    -p 8000:8000 ghcr.io/r0b0tlab/vibethinker-3b-nvfp4

Quantization Recipe

Parameter	Value
Tool	NVIDIA Model Optimizer 0.44.0
Config	`NVFP4_DEFAULT_CFG` (W4A4)
Group size	16
Calibration dataset	`abisee/cnn_dailymail` 3.0.0
Calibration samples	512
Sequence length	1024
Batch size	16
Export	`export_hf_checkpoint` with `torch.inference_mode()`
Quantized layers	903 (all Linear layers)
Exclusions	lm_head (tied with embed_tokens: `tie_word_embeddings=true`)

Reproduce

uv venv .venv --python 3.12
source .venv/bin/activate
uv pip install torch --index-url https://download.pytorch.org/whl/cu130
uv pip install "transformers>=5.4" safetensors accelerate datasets
uv pip install "nvidia-modelopt[hf]>=0.44.0"

python3 -c "
import torch, modelopt.torch.quantization as mtq
from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from modelopt.torch.export import export_hf_checkpoint

model = AutoModelForCausalLM.from_pretrained('WeiboAI/VibeThinker-3B',
    torch_dtype=torch.bfloat16, device_map='cpu', low_cpu_mem_usage=True)
for n, p in model.named_parameters(): p.data = p.data.to('cuda')
for n, b in model.named_buffers(): b.data = b.data.to('cuda')
tokenizer = AutoTokenizer.from_pretrained('WeiboAI/VibeThinker-3B')
calib = load_dataset('abisee/cnn_dailymail', '3.0.0', split='train[:512]')
def fwd(m):
    for i in range(0, 512, 16):
        b = calib[i:i+16]['article']
        m(**tokenizer(b, return_tensors='pt', padding=True, truncation=True,
            max_length=1024).to('cuda'))
mtq.quantize(model, mtq.NVFP4_DEFAULT_CFG, fwd)
with torch.inference_mode(): export_hf_checkpoint(model, export_dir='./output')
print('Done!')
"

How to Verify

# 1. Download and serve (see Quick Start above)
# 2. Test identity/reasoning
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"messages":[{"role":"user","content":"What is 15 * 7 + 3?"}],"max_tokens":256}'
# Expected: 108 (correct arithmetic through <think> reasoning)

Benchmarks (GB10 / SM121)

c1 tg128 (prefill=2048, output=128, vLLM 0.23.0, random dataset):

Metric	BF16	NVFP4	Speedup
Output tok/s	27.4	71.3	2.60×
TTFT	233 ms	51 ms	4.57×
TPOT	34.8 ms	13.6 ms	2.56×

Concurrency ramp (tg128):

Concurrency	BF16 tok/s	NVFP4 tok/s	Speedup
c1	27.4	71.3	2.60×
c2	65.5	128.0	1.96×
c4	123.3	243.1	1.97×
c8	219.5	423.0	1.93×

Depth stability (c1 tg128):

Depth	NVFP4 tok/s	TTFT
d0	71.3	51 ms
d4096	67.7	76 ms
d8192	60.5	274 ms
d16384	56.4	383 ms

Full benchmark report: github.com/r0b0tlab/vibethinker-3b-nvfp4

`hf_quant_config.json`

{
    "producer": {"name": "modelopt", "version": "0.44.0"},
    "quantization": {
        "quant_algo": "NVFP4",
        "kv_cache_quant_algo": null,
        "group_size": 16,
        "exclude_modules": ["lm_head"]
    }
}

Notes and Limitations

This is an NVFP4 quantization of a reasoning model. The model outputs <think> tags — use --reasoning-parser deepseek_r1 in vLLM for chat use.
Quantization scope: All Linear layers (903 quantizers). lm_head excluded (tied with embed_tokens). No multimodal exclusion needed (text-only model).
Calibration is text-only. The model has no vision/audio components.
Small model dimensions (hidden=2048): This model benefits significantly from NVFP4 FP4 tensor cores. Larger models with hidden ≥ 4096 may show smaller or even negative speedups — always benchmark your specific model.
Not re-tested on full benchmark suites: Correctness verified on identity tasks. Full benchmark eval (AIME, LiveCodeBench) not yet reproduced on this quantized checkpoint. The original model's reasoning quality should be preserved given the 4/4 identity test pass.

License

MIT (inherited from base model WeiboAI/VibeThinker-3B).
Quantization artifact copyright 2026 r0b0tlab, distributed under the same MIT license.
Calibration data from CNN/DailyMail (Apache 2.0).

Citation

@misc{vibethinker2026,
  title={VibeThinker: Optimizing Post-training for Small Model Reasoning},
  author={WeiboAI},
  year={2026},
  url={https://huggingface.co/papers/2606.16140}
}

@misc{qwen2.5-coder2025,
  title={Qwen2.5-Coder: Code is More Than Language},
  author={Qwen Team, Alibaba Group},
  year={2025},
  url={https://huggingface.co/Qwen/Qwen2.5-Coder-3B}
}

@misc{modelopt2025,
  title={NVIDIA TensorRT Model Optimizer},
  author={NVIDIA},
  year={2025},
  url={https://github.com/NVIDIA/TensorRT-Model-Optimizer}
}

@misc{see2017cnndailymail,
  title={Get To The Point: Summarization with Pointer-Generator Networks},
  author={See, Abigail and Liu, Peter J. and Manning, Christopher D.},
  year={2017},
  journal={ACL},
  url={https://huggingface.co/datasets/abisee/cnn_dailymail}
}

@misc{vllm2025,
  title={vLLM: Easy, Fast, and Cheap LLM Serving},
  author={vLLM Team},
  year={2025},
  url={https://github.com/vllm-project/vllm}
}