Instructions to use WeiboAI/VibeThinker-3B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use WeiboAI/VibeThinker-3B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="WeiboAI/VibeThinker-3B") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("WeiboAI/VibeThinker-3B") model = AutoModelForCausalLM.from_pretrained("WeiboAI/VibeThinker-3B") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- Local Apps Settings
- vLLM
How to use WeiboAI/VibeThinker-3B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "WeiboAI/VibeThinker-3B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "WeiboAI/VibeThinker-3B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/WeiboAI/VibeThinker-3B
- SGLang
How to use WeiboAI/VibeThinker-3B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "WeiboAI/VibeThinker-3B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "WeiboAI/VibeThinker-3B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "WeiboAI/VibeThinker-3B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "WeiboAI/VibeThinker-3B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use WeiboAI/VibeThinker-3B with Docker Model Runner:
docker model run hf.co/WeiboAI/VibeThinker-3B
thank u
i knew this day would come that someone would prove all it takes is the right build i wana cry .. this shii beautiful bro
Appreciate it! Let’s keep pushing to make large models cheaper and accessible to everyone.
i have been doing some research bro .. i am convinced that a models trained token count doesnt matter as long as it covers most communication .. i know some qwen3 models are 36 trillion tokens but ranges from 0.5b - 500m+ parameters .. which mean the model can be a smaller overall size by using a smaller token count .. the chinchilla model was supposedly 70b parameter w 1.7t tokens making a better ratio for connectivity/token count
with that being said i think if u try this on a model architecture that has more connectivity but with less tokens u can get a better performance .. thi is only from my own speculations though. havent proven this to be the case on my own experiments
Yes, this is a great project.
i am convinced that a models trained token count doesnt matter as long as it covers most communication
Actually, the literature consistently reports that higher diverse token count often leads to a better model. So a smaller model trained on more data than another model of the same size may perform better. But that is only if both data mixtures are of high quality. Otherwise, less data could outperform the noise.
But the important part is that smaller itself (less params) doesn't always mean worse performing, and that's what VibeThinker helps prove.
You may be interested in this blog post on how the tiny Falcon reasoning models were made "mighty": https://huggingface.co/spaces/tiiuae/tiny-h1-blogpost
The key points over the last few years of research:
- A little good data outperforms lots of bad data, but only minimize if you're actually filtering out the bad stuff.
- The principle training mode is not SFT, and CPT plus RL are extremely important for real learning.
- Small models are awesome when they're stuck to a single or a small selection of tasks they can learn verifiably.