Add evaluation results

This PR adds evaluation results extracted from the openPangu-2.0-Flash model card benchmark table to the .eval_results/ directory, following the Hugging Face Hub evaluation-results specification.

The model card reports two variants (Thinking and Non-Thinking); each is included as a separate entry.

Benchmarks Added

Benchmark	Thinking	Non-Thinking	Hub Dataset	Task ID
AIME 2026	93.3	86.5	`MathArena/aime_2026`	`MathArena/aime_2026`
GPQA-Diamond	83.7	79.8	`Idavidrein/gpqa`	`diamond`
Claw-Eval	57.7	58.2	`claw-eval/Claw-Eval`	`general`
WildClawBench	41.5	35.0	`internlm/WildClawBench`	`overall`
SWE-bench Verified	63.1	57.6	`SWE-bench/SWE-bench_Verified`	`swe_bench_%_resolved`

Notes:

Claw-Eval: the model card reports a single score; mapped to the general task/split (the default Claw-Eval split).
HMMT Feb 2025 was not added — the only registered Hub benchmark is MathArena/hmmt_feb_2026 (Feb 2026), a different exam.

Benchmarks Skipped (Not Registered on Hub)

Benchmark	Thinking	Non-Thinking
CL-Bench	20.4	15.5
IFEval	95.9	89.3
IFBench	79.6	54.4
AgentIF	44.9	43.9
SysBench	91.1	87.9
Multichallenge	68.4	51.9
HMMT Feb 2025	91.5	67.1
IMO-AnswerBench	76.5	62.3
BBEH	62.5	51.5
BrowseComp	57.0	-
SkillsBench	42.6	40.0
PinchBench	85.6	82.5
MCP-Atlas	58.9	47.9
TAU2-Bench	88.0	74.0
LiveCodeBench V6	85.1	50.9
DeepCodeBench	76.5	70.9
FeatBench	45.9	45.8

These can be added once the benchmark authors register their eval.yaml on the Hub.

Source

Model card: https://huggingface.co/openpangu/openPangu-2.0-Flash

Files Added

.eval_results/openPangu-2.0-Flash.yaml

Verification

These results were extracted from the official benchmark table published in the openPangu-2.0-Flash model card. No verified token is provided as these were not run via HF Jobs with inspect-ai.

Add Thinking/Non-Thinking variant notesb314055b

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Ready to merge

This branch is ready to get merged automatically.

· Sign up or log in to comment