Instructions to use WeiboAI/VibeThinker-3B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use WeiboAI/VibeThinker-3B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="WeiboAI/VibeThinker-3B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("WeiboAI/VibeThinker-3B")
model = AutoModelForCausalLM.from_pretrained("WeiboAI/VibeThinker-3B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
Local Apps Settings

vLLM

How to use WeiboAI/VibeThinker-3B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "WeiboAI/VibeThinker-3B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "WeiboAI/VibeThinker-3B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/WeiboAI/VibeThinker-3B

SGLang

How to use WeiboAI/VibeThinker-3B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "WeiboAI/VibeThinker-3B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "WeiboAI/VibeThinker-3B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "WeiboAI/VibeThinker-3B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "WeiboAI/VibeThinker-3B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use WeiboAI/VibeThinker-3B with Docker Model Runner:
```
docker model run hf.co/WeiboAI/VibeThinker-3B
```

thank u

#24

by StarpowerTechnology - opened 6 days ago

Discussion

StarpowerTechnology

6 days ago

i knew this day would come that someone would prove all it takes is the right build i wana cry .. this shii beautiful bro

junlinzhang

WeiboAI org 6 days ago

Appreciate it! Let’s keep pushing to make large models cheaper and accessible to everyone.

StarpowerTechnology

5 days ago

i have been doing some research bro .. i am convinced that a models trained token count doesnt matter as long as it covers most communication .. i know some qwen3 models are 36 trillion tokens but ranges from 0.5b - 500m+ parameters .. which mean the model can be a smaller overall size by using a smaller token count .. the chinchilla model was supposedly 70b parameter w 1.7t tokens making a better ratio for connectivity/token count

with that being said i think if u try this on a model architecture that has more connectivity but with less tokens u can get a better performance .. thi is only from my own speculations though. havent proven this to be the case on my own experiments

urroxyz

3 days ago

•

edited 3 days ago

Yes, this is a great project.

i am convinced that a models trained token count doesnt matter as long as it covers most communication

Actually, the literature consistently reports that higher diverse token count often leads to a better model. So a smaller model trained on more data than another model of the same size may perform better. But that is only if both data mixtures are of high quality. Otherwise, less data could outperform the noise.

But the important part is that smaller itself (less params) doesn't always mean worse performing, and that's what VibeThinker helps prove.

You may be interested in this blog post on how the tiny Falcon reasoning models were made "mighty": https://huggingface.co/spaces/tiiuae/tiny-h1-blogpost

The key points over the last few years of research:

A little good data outperforms lots of bad data, but only minimize if you're actually filtering out the bad stuff.
The principle training mode is not SFT, and CPT plus RL are extremely important for real learning.
Small models are awesome when they're stuck to a single or a small selection of tasks they can learn verifiably.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment