Add evaluation results

#4
by SaylorTwift HF Staff - opened

Add Evaluation Results for openpangu/openPangu-2.0-Flash

Summary

This PR adds evaluation results extracted from the openPangu-2.0-Flash model card benchmark table to the .eval_results/ directory, following the Hugging Face Hub evaluation-results specification.

The model card reports two variants (Thinking and Non-Thinking); each is included as a separate entry.

Benchmarks Added

Benchmark Thinking Non-Thinking Hub Dataset Task ID
AIME 2026 93.3 86.5 MathArena/aime_2026 MathArena/aime_2026
GPQA-Diamond 83.7 79.8 Idavidrein/gpqa diamond
Claw-Eval 57.7 58.2 claw-eval/Claw-Eval general
WildClawBench 41.5 35.0 internlm/WildClawBench overall
SWE-bench Verified 63.1 57.6 SWE-bench/SWE-bench_Verified swe_bench_%_resolved

Notes:

  • Claw-Eval: the model card reports a single score; mapped to the general task/split (the default Claw-Eval split).
  • HMMT Feb 2025 was not added — the only registered Hub benchmark is MathArena/hmmt_feb_2026 (Feb 2026), a different exam.

Benchmarks Skipped (Not Registered on Hub)

Benchmark Thinking Non-Thinking
CL-Bench 20.4 15.5
IFEval 95.9 89.3
IFBench 79.6 54.4
AgentIF 44.9 43.9
SysBench 91.1 87.9
Multichallenge 68.4 51.9
HMMT Feb 2025 91.5 67.1
IMO-AnswerBench 76.5 62.3
BBEH 62.5 51.5
BrowseComp 57.0 -
SkillsBench 42.6 40.0
PinchBench 85.6 82.5
MCP-Atlas 58.9 47.9
TAU2-Bench 88.0 74.0
LiveCodeBench V6 85.1 50.9
DeepCodeBench 76.5 70.9
FeatBench 45.9 45.8

These can be added once the benchmark authors register their eval.yaml on the Hub.

Source

Files Added

  • .eval_results/openPangu-2.0-Flash.yaml

Verification

These results were extracted from the official benchmark table published in the openPangu-2.0-Flash model card. No verified token is provided as these were not run via HF Jobs with inspect-ai.

Ready to merge
This branch is ready to get merged automatically.

Sign up or log in to comment