Add evaluation results
#4
by SaylorTwift HF Staff - opened
No description provided.
Add Evaluation Results for openpangu/openPangu-2.0-Flash
Summary
This PR adds evaluation results extracted from the openPangu-2.0-Flash model card benchmark table to the .eval_results/ directory, following the Hugging Face Hub evaluation-results specification.
The model card reports two variants (Thinking and Non-Thinking); each is included as a separate entry.
Benchmarks Added
| Benchmark | Thinking | Non-Thinking | Hub Dataset | Task ID |
|---|---|---|---|---|
| AIME 2026 | 93.3 | 86.5 | MathArena/aime_2026 |
MathArena/aime_2026 |
| GPQA-Diamond | 83.7 | 79.8 | Idavidrein/gpqa |
diamond |
| Claw-Eval | 57.7 | 58.2 | claw-eval/Claw-Eval |
general |
| WildClawBench | 41.5 | 35.0 | internlm/WildClawBench |
overall |
| SWE-bench Verified | 63.1 | 57.6 | SWE-bench/SWE-bench_Verified |
swe_bench_%_resolved |
Notes:
- Claw-Eval: the model card reports a single score; mapped to the
generaltask/split (the default Claw-Eval split). - HMMT Feb 2025 was not added — the only registered Hub benchmark is
MathArena/hmmt_feb_2026(Feb 2026), a different exam.
Benchmarks Skipped (Not Registered on Hub)
| Benchmark | Thinking | Non-Thinking |
|---|---|---|
| CL-Bench | 20.4 | 15.5 |
| IFEval | 95.9 | 89.3 |
| IFBench | 79.6 | 54.4 |
| AgentIF | 44.9 | 43.9 |
| SysBench | 91.1 | 87.9 |
| Multichallenge | 68.4 | 51.9 |
| HMMT Feb 2025 | 91.5 | 67.1 |
| IMO-AnswerBench | 76.5 | 62.3 |
| BBEH | 62.5 | 51.5 |
| BrowseComp | 57.0 | - |
| SkillsBench | 42.6 | 40.0 |
| PinchBench | 85.6 | 82.5 |
| MCP-Atlas | 58.9 | 47.9 |
| TAU2-Bench | 88.0 | 74.0 |
| LiveCodeBench V6 | 85.1 | 50.9 |
| DeepCodeBench | 76.5 | 70.9 |
| FeatBench | 45.9 | 45.8 |
These can be added once the benchmark authors register their eval.yaml on the Hub.
Source
Files Added
.eval_results/openPangu-2.0-Flash.yaml
Verification
These results were extracted from the official benchmark table published in the openPangu-2.0-Flash model card. No verified token is provided as these were not run via HF Jobs with inspect-ai.