About AIME26 Results for This Model

#9
by Zephinue - opened

I tried to reproduce the results on AIME26 for this model and did not quite understand the specific setting of @Avg16 . Standard reproduction with the eval script I have yields discrepancies (3/30 vs 40% claimed). My reproduction settings are:

  • Max tokens: 16384
  • Thinking: true
  • Scoring: correct as long as the right answer appears in the whole response.
  • Runtime: Transformers
  • Platform: NVIDIA H20

I think I got something wrong. Can we maybe have eval scripts in future releases? Thank you very much and I love the series. We don't see much of functional open-source tiny LLMs after Qwen3.5.

OpenBMB org

Hi @Zephinue , thank you for your interest in MiniCPM5-1B and for the kind words!

We believe the discrepancy is primarily due to the max_tokens setting. Here are the details of our evaluation setup:

  1. @Avg16 = averaging over 16 independent samples

We run each of the 30 AIME problems 16 times with temperature=0.9, top_p=0.95, then average the per-run accuracy across all 16 runs. This is a standard variance-reduction technique.

  1. max_tokens should be set to at least 65,536

Our evaluation uses max_tokens=65536. The actual generation length statistics on AIME 2026 are:

Mean: ~33,000 tokens per problem
Median (P50): ~32,000 tokens
P90: ~61,000 tokens
P95: ~65,000+ tokens
This is consistent with the broader community's practice β€” for AIME-level competition math, most reasoning models require a max_tokens of 65K–80K to perform well, and some models need 80K+ to fully express their reasoning chains. With your max_tokens=16,384, most responses will be truncated mid-reasoning before reaching the final \boxed{} answer, which explains the 3/30 result.

  1. Inference backend

We recommend using SGLang or vLLM for inference β€” they provide significantly faster generation speed (especially important given the long outputs of ~33K tokens per problem), and will most closely match our internal evaluation setup. HuggingFace Transformers should also produce correct results given the same generation parameters, but will be considerably slower.

Recommended reproduction settings:

Inference backend: SGLang or vLLM
max_tokens: 65536 (or higher)
temperature: 0.9
top_p: 0.95
Sampling: 16 independent runs, average accuracy
Thinking: enabled
Thanks again for trying MiniCPM5!

Sign up or log in to comment