Instructions to use openbmb/MiniCPM5-1B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use openbmb/MiniCPM5-1B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="openbmb/MiniCPM5-1B") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("openbmb/MiniCPM5-1B") model = AutoModelForCausalLM.from_pretrained("openbmb/MiniCPM5-1B") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use openbmb/MiniCPM5-1B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "openbmb/MiniCPM5-1B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "openbmb/MiniCPM5-1B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/openbmb/MiniCPM5-1B
- SGLang
How to use openbmb/MiniCPM5-1B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "openbmb/MiniCPM5-1B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "openbmb/MiniCPM5-1B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "openbmb/MiniCPM5-1B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "openbmb/MiniCPM5-1B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use openbmb/MiniCPM5-1B with Docker Model Runner:
docker model run hf.co/openbmb/MiniCPM5-1B
About AIME26 Results for This Model
I tried to reproduce the results on AIME26 for this model and did not quite understand the specific setting of @Avg16 . Standard reproduction with the eval script I have yields discrepancies (3/30 vs 40% claimed). My reproduction settings are:
- Max tokens: 16384
- Thinking: true
- Scoring: correct as long as the right answer appears in the whole response.
- Runtime: Transformers
- Platform: NVIDIA H20
I think I got something wrong. Can we maybe have eval scripts in future releases? Thank you very much and I love the series. We don't see much of functional open-source tiny LLMs after Qwen3.5.
Hi @Zephinue , thank you for your interest in MiniCPM5-1B and for the kind words!
We believe the discrepancy is primarily due to the max_tokens setting. Here are the details of our evaluation setup:
- @Avg16 = averaging over 16 independent samples
We run each of the 30 AIME problems 16 times with temperature=0.9, top_p=0.95, then average the per-run accuracy across all 16 runs. This is a standard variance-reduction technique.
- max_tokens should be set to at least 65,536
Our evaluation uses max_tokens=65536. The actual generation length statistics on AIME 2026 are:
Mean: ~33,000 tokens per problem
Median (P50): ~32,000 tokens
P90: ~61,000 tokens
P95: ~65,000+ tokens
This is consistent with the broader community's practice β for AIME-level competition math, most reasoning models require a max_tokens of 65Kβ80K to perform well, and some models need 80K+ to fully express their reasoning chains. With your max_tokens=16,384, most responses will be truncated mid-reasoning before reaching the final \boxed{} answer, which explains the 3/30 result.
- Inference backend
We recommend using SGLang or vLLM for inference β they provide significantly faster generation speed (especially important given the long outputs of ~33K tokens per problem), and will most closely match our internal evaluation setup. HuggingFace Transformers should also produce correct results given the same generation parameters, but will be considerably slower.
Recommended reproduction settings:
Inference backend: SGLang or vLLM
max_tokens: 65536 (or higher)
temperature: 0.9
top_p: 0.95
Sampling: 16 independent runs, average accuracy
Thinking: enabled
Thanks again for trying MiniCPM5!