VibeThinker-3B-MLX-nvfp4

This repository contains the 4-bit NVFP4 quantized weights for WeiboAI/VibeThinker-3B, optimized for deployment on Apple Silicon via MLX framework wrappers like oMLX.

VibeThinker-3B is a 3-billion parameter dense reasoning specialist developed by Sina Weibo Inc. Built on top of Qwen2.5-Coder-3B using the Spectrum-to-Signal post-training principle (Curriculum SFT + Multi-Domain RL + Offline Self-Distillation), it delivers frontier-tier performance in strict math, coding, and verifiable logical reasoning tasks while maintaining a remarkably small foot-print.


🚀 Quantization Efficiency Performance (vs. BF16 Baseline)

Benchmarked on an Apple Silicon M5 Max using the oMLX inference engine, this nvfp4 quantization achieves profound speedups and massive memory reductions over the unquantized BF16 model, with virtually no degradation in core reasoning accuracy.

Key Efficiency Wins:

  • ⚡ Generation Speedup: Achieves a massive ~3.05× throughput increase during generation (tg TPS jumps from ~80 tok/s to 246.0 tok/s at single-request context).
  • 📉 VRAM Footprint Reduction: Memory consumption drops by ~61.2%, requiring only 2.46 GB of peak memory compared to the 6.35 GB required by the BF16 baseline.
  • 📈 Batched Scaling: Under continuous batching (4x batch size), token generation throughput scales efficiently to 459.0 tok/s (a 1.87× speedup over the 1x baseline).
  • ⏱️ Latency: End-to-End time for standard context generation drops by 57.2%, finishing tasks in just 0.80 seconds compared to 1.87 seconds on BF16.

📊 Evaluation Intelligence Benchmarks

Evaluated directly using this VibeThinker-3B-MLX-nvfp4 checkpoint:

Benchmark Accuracy
MMLU 71.5%
HUMANEVAL 92.0%
MBPP 82.0%
GSM8K 95.0%
MATHQA 91.0%

⚠️ Note on Scope: VibeThinker-3B is an extreme reasoning core tailored specifically for domains with clear verification signals (Math, Competitive Programming, STEM). It is not optimized for open-domain factual knowledge, general chat conversation, or agentic tool calling.


🛠️ Usage & Quickstart

To run this model, ensure you are utilizing an inference engine capable of loading the nvfp4 metadata wrapper layout natively (such as oMLX or updated versions of mlx-lm).

Example using oMLX CLI

# Clone and build omlx environment if you haven't already
# Run the model natively using the Auto engine:
omlx bench --model your-hf-username/VibeThinker-3B-MLX-nvfp4 --prompt "Your math/code problem here"
Benchmark table
Downloads last month
-
Safetensors
Model size
0.8B params
Tensor type
U8
·
U32
·
BF16
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for bkideas/VibeThinker-3B-MLX-nvfp4

Base model

Qwen/Qwen2.5-3B
Quantized
(42)
this model