Instructions to use WeiboAI/VibeThinker-3B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use WeiboAI/VibeThinker-3B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="WeiboAI/VibeThinker-3B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForMultimodalLM

tokenizer = AutoTokenizer.from_pretrained("WeiboAI/VibeThinker-3B")
model = AutoModelForMultimodalLM.from_pretrained("WeiboAI/VibeThinker-3B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use WeiboAI/VibeThinker-3B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "WeiboAI/VibeThinker-3B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "WeiboAI/VibeThinker-3B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/WeiboAI/VibeThinker-3B

SGLang

How to use WeiboAI/VibeThinker-3B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "WeiboAI/VibeThinker-3B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "WeiboAI/VibeThinker-3B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "WeiboAI/VibeThinker-3B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "WeiboAI/VibeThinker-3B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use WeiboAI/VibeThinker-3B with Docker Model Runner:
```
docker model run hf.co/WeiboAI/VibeThinker-3B
```

some benchmark results for ZebraLogic

#10

by khanh2023 - opened 4 days ago

Discussion

khanh2023

4 days ago

•

edited 2 days ago

I tried to benchmark on a subset of ZebraLogic consists of all 3x3 and 4x4 examples. In total, there are 80 data points. I used the recommended sampling parameters from the technical report. Moreover, I also set the maximum completion tokens to be 40k which is plenty for ZebraLogic. On average, VibeThinker-3B use 7000 tokens for each question including refining json output through iteratively asking it to output the correct format.

In total 80 examples, the bf16 version output correctly 54 times first try. There are 3 examples where it cannot output the correct json format after 5 retries, even if we give it 3 of them correct, the accuracy of VibeThinker-3B on reach 85%, similar to Qwen3.5 4B q4

Below is the benchmark numbers for some of the models I have access to.

lsx666

WeiboAI org 3 days ago

Thanks for running this benchmark — this is very useful.

One thing I’d be curious about is how much of the result comes from the mixed_4_6 quantization. VibeThinker-3B seems quite sensitive to quantization, especially on tasks that require both long reasoning and strict structured output like JSON.

Would you be willing to also test the unquantized HF version if possible? That would help us understand whether the JSON failures are mainly from the base model behavior, the quantization, or the inference/template setup.

Either way, the self-correction after parse errors is an interesting signal. Thanks again for sharing the numbers.

khanh2023

2 days ago

I have updated the results for both mixed_4_6 quantization and bf16, VibeThinker-3B is no where near frontier level in logical reasoning and even the full precision model doesn't surpass Qwen3.5 q4

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment