Instructions to use bkideas/VibeThinker-3B-MLX-nvfp4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use bkideas/VibeThinker-3B-MLX-nvfp4 with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("bkideas/VibeThinker-3B-MLX-nvfp4") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
- Pi
How to use bkideas/VibeThinker-3B-MLX-nvfp4 with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "bkideas/VibeThinker-3B-MLX-nvfp4"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "bkideas/VibeThinker-3B-MLX-nvfp4" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use bkideas/VibeThinker-3B-MLX-nvfp4 with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "bkideas/VibeThinker-3B-MLX-nvfp4"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default bkideas/VibeThinker-3B-MLX-nvfp4
Run Hermes
hermes
- MLX LM
How to use bkideas/VibeThinker-3B-MLX-nvfp4 with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "bkideas/VibeThinker-3B-MLX-nvfp4"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "bkideas/VibeThinker-3B-MLX-nvfp4" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "bkideas/VibeThinker-3B-MLX-nvfp4", "messages": [ {"role": "user", "content": "Hello"} ] }'
VibeThinker-3B-MLX-nvfp4
This repository contains the 4-bit NVFP4 quantized weights for WeiboAI/VibeThinker-3B, optimized for deployment on Apple Silicon via MLX framework wrappers like oMLX.
VibeThinker-3B is a 3-billion parameter dense reasoning specialist developed by Sina Weibo Inc. Built on top of Qwen2.5-Coder-3B using the Spectrum-to-Signal post-training principle (Curriculum SFT + Multi-Domain RL + Offline Self-Distillation), it delivers frontier-tier performance in strict math, coding, and verifiable logical reasoning tasks while maintaining a remarkably small foot-print.
🚀 Quantization Efficiency Performance (vs. BF16 Baseline)
Benchmarked on an Apple Silicon M5 Max using the oMLX inference engine, this nvfp4 quantization achieves profound speedups and massive memory reductions over the unquantized BF16 model, with virtually no degradation in core reasoning accuracy.
Key Efficiency Wins:
- ⚡ Generation Speedup: Achieves a massive ~3.05× throughput increase during generation (
tg TPSjumps from ~80 tok/s to 246.0 tok/s at single-request context). - 📉 VRAM Footprint Reduction: Memory consumption drops by ~61.2%, requiring only 2.46 GB of peak memory compared to the 6.35 GB required by the BF16 baseline.
- 📈 Batched Scaling: Under continuous batching (4x batch size), token generation throughput scales efficiently to 459.0 tok/s (a 1.87× speedup over the 1x baseline).
- ⏱️ Latency: End-to-End time for standard context generation drops by 57.2%, finishing tasks in just 0.80 seconds compared to 1.87 seconds on BF16.
📊 Evaluation Intelligence Benchmarks
Evaluated directly using this VibeThinker-3B-MLX-nvfp4 checkpoint:
| Benchmark | Accuracy |
|---|---|
| MMLU | 71.5% |
| HUMANEVAL | 92.0% |
| MBPP | 82.0% |
| GSM8K | 95.0% |
| MATHQA | 91.0% |
⚠️ Note on Scope: VibeThinker-3B is an extreme reasoning core tailored specifically for domains with clear verification signals (Math, Competitive Programming, STEM). It is not optimized for open-domain factual knowledge, general chat conversation, or agentic tool calling.
🛠️ Usage & Quickstart
To run this model, ensure you are utilizing an inference engine capable of loading the nvfp4 metadata wrapper layout natively (such as oMLX or updated versions of mlx-lm).
Example using oMLX CLI
# Clone and build omlx environment if you haven't already
# Run the model natively using the Auto engine:
omlx bench --model your-hf-username/VibeThinker-3B-MLX-nvfp4 --prompt "Your math/code problem here"
- Downloads last month
- -
4-bit