Instructions to use openbmb/MiniCPM5-1B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use openbmb/MiniCPM5-1B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="openbmb/MiniCPM5-1B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("openbmb/MiniCPM5-1B")
model = AutoModelForCausalLM.from_pretrained("openbmb/MiniCPM5-1B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use openbmb/MiniCPM5-1B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "openbmb/MiniCPM5-1B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "openbmb/MiniCPM5-1B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/openbmb/MiniCPM5-1B

SGLang

How to use openbmb/MiniCPM5-1B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "openbmb/MiniCPM5-1B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "openbmb/MiniCPM5-1B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "openbmb/MiniCPM5-1B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "openbmb/MiniCPM5-1B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use openbmb/MiniCPM5-1B with Docker Model Runner:
```
docker model run hf.co/openbmb/MiniCPM5-1B
```

About AIME26 Results for This Model

by Zephinue - opened 4 days ago

Discussion

Zephinue

4 days ago

I tried to reproduce the results on AIME26 for this model and did not quite understand the specific setting of @Avg16 . Standard reproduction with the eval script I have yields discrepancies (3/30 vs 40% claimed). My reproduction settings are:

Max tokens: 16384
Thinking: true
Scoring: correct as long as the right answer appears in the whole response.
Runtime: Transformers
Platform: NVIDIA H20

I think I got something wrong. Can we maybe have eval scripts in future releases? Thank you very much and I love the series. We don't see much of functional open-source tiny LLMs after Qwen3.5.

beyoung

OpenBMB org 4 days ago

Hi @Zephinue , thank you for your interest in MiniCPM5-1B and for the kind words!

We believe the discrepancy is primarily due to the max_tokens setting. Here are the details of our evaluation setup:

@Avg16 = averaging over 16 independent samples

We run each of the 30 AIME problems 16 times with temperature=0.9, top_p=0.95, then average the per-run accuracy across all 16 runs. This is a standard variance-reduction technique.

max_tokens should be set to at least 65,536

Our evaluation uses max_tokens=65536. The actual generation length statistics on AIME 2026 are:

Mean: ~33,000 tokens per problem
Median (P50): ~32,000 tokens
P90: ~61,000 tokens
P95: ~65,000+ tokens
This is consistent with the broader community's practice — for AIME-level competition math, most reasoning models require a max_tokens of 65K–80K to perform well, and some models need 80K+ to fully express their reasoning chains. With your max_tokens=16,384, most responses will be truncated mid-reasoning before reaching the final \boxed{} answer, which explains the 3/30 result.

Inference backend

We recommend using SGLang or vLLM for inference — they provide significantly faster generation speed (especially important given the long outputs of ~33K tokens per problem), and will most closely match our internal evaluation setup. HuggingFace Transformers should also produce correct results given the same generation parameters, but will be considerably slower.

Recommended reproduction settings:

Inference backend: SGLang or vLLM
max_tokens: 65536 (or higher)
temperature: 0.9
top_p: 0.95
Sampling: 16 independent runs, average accuracy
Thinking: enabled
Thanks again for trying MiniCPM5!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment