Instructions to use bkideas/VibeThinker-3B-MLX-nvfp4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use bkideas/VibeThinker-3B-MLX-nvfp4 with MLX:

# Make sure mlx-lm is installed
# pip install --upgrade mlx-lm

# Generate text with mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("bkideas/VibeThinker-3B-MLX-nvfp4")

prompt = "Write a story about Einstein"
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True
)

text = generate(model, tokenizer, prompt=prompt, verbose=True)

Notebooks
Google Colab
Kaggle
Local Apps Settings
LM Studio

How to use bkideas/VibeThinker-3B-MLX-nvfp4 with Pi:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "bkideas/VibeThinker-3B-MLX-nvfp4"

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "mlx-lm": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "bkideas/VibeThinker-3B-MLX-nvfp4"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use bkideas/VibeThinker-3B-MLX-nvfp4 with Hermes Agent:

Start the MLX server

# Install MLX LM:
uv tool install mlx-lm
# Start a local OpenAI-compatible server:
mlx_lm.server --model "bkideas/VibeThinker-3B-MLX-nvfp4"

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default bkideas/VibeThinker-3B-MLX-nvfp4

Run Hermes

hermes

MLX LM

How to use bkideas/VibeThinker-3B-MLX-nvfp4 with MLX LM:

Generate or start a chat session

# Install MLX LM
uv tool install mlx-lm
# Interactive chat REPL
mlx_lm.chat --model "bkideas/VibeThinker-3B-MLX-nvfp4"

Run an OpenAI-compatible server

# Install MLX LM
uv tool install mlx-lm
# Start the server
mlx_lm.server --model "bkideas/VibeThinker-3B-MLX-nvfp4"
# Calling the OpenAI-compatible server with curl
curl -X POST "http://localhost:8000/v1/chat/completions" \
   -H "Content-Type: application/json" \
   --data '{
     "model": "bkideas/VibeThinker-3B-MLX-nvfp4",
     "messages": [
       {"role": "user", "content": "Hello"}
     ]
   }'

VibeThinker-3B-MLX-nvfp4

This repository contains the 4-bit NVFP4 quantized weights for WeiboAI/VibeThinker-3B, optimized for deployment on Apple Silicon via MLX framework wrappers like oMLX.

VibeThinker-3B is a 3-billion parameter dense reasoning specialist developed by Sina Weibo Inc. Built on top of Qwen2.5-Coder-3B using the Spectrum-to-Signal post-training principle (Curriculum SFT + Multi-Domain RL + Offline Self-Distillation), it delivers frontier-tier performance in strict math, coding, and verifiable logical reasoning tasks while maintaining a remarkably small foot-print.

🚀 Quantization Efficiency Performance (vs. BF16 Baseline)

Benchmarked on an Apple Silicon M5 Max using the oMLX inference engine, this nvfp4 quantization achieves profound speedups and massive memory reductions over the unquantized BF16 model, with virtually no degradation in core reasoning accuracy.

Key Efficiency Wins:

⚡ Generation Speedup: Achieves a massive ~3.05× throughput increase during generation (tg TPS jumps from ~80 tok/s to 246.0 tok/s at single-request context).
📉 VRAM Footprint Reduction: Memory consumption drops by ~61.2%, requiring only 2.46 GB of peak memory compared to the 6.35 GB required by the BF16 baseline.
📈 Batched Scaling: Under continuous batching (4x batch size), token generation throughput scales efficiently to 459.0 tok/s (a 1.87× speedup over the 1x baseline).
⏱️ Latency: End-to-End time for standard context generation drops by 57.2%, finishing tasks in just 0.80 seconds compared to 1.87 seconds on BF16.

📊 Evaluation Intelligence Benchmarks

Evaluated directly using this VibeThinker-3B-MLX-nvfp4 checkpoint:

Benchmark	Accuracy
MMLU	71.5%
HUMANEVAL	92.0%
MBPP	82.0%
GSM8K	95.0%
MATHQA	91.0%

⚠️ Note on Scope: VibeThinker-3B is an extreme reasoning core tailored specifically for domains with clear verification signals (Math, Competitive Programming, STEM). It is not optimized for open-domain factual knowledge, general chat conversation, or agentic tool calling.

🛠️ Usage & Quickstart

To run this model, ensure you are utilizing an inference engine capable of loading the nvfp4 metadata wrapper layout natively (such as oMLX or updated versions of mlx-lm).

Example using `oMLX` CLI

# Clone and build omlx environment if you haven't already
# Run the model natively using the Auto engine:
omlx bench --model your-hf-username/VibeThinker-3B-MLX-nvfp4 --prompt "Your math/code problem here"

Downloads last month: -

Safetensors

Model size

0.8B params

Tensor type

U32

BF16

MLX

Hardware compatibility

4-bit

Model tree for bkideas/VibeThinker-3B-MLX-nvfp4

Base model

Qwen/Qwen2.5-3B

Finetuned

Qwen/Qwen2.5-Coder-3B

Finetuned

WeiboAI/VibeThinker-3B

Quantized

(42)

this model