đźš© Report: Spam

#1
by manaf1234 - opened

The authors of this repository have uploaded unverified, hardcoded metadata claiming a 25.79% score on the GPQA Diamond benchmark.

GPQA Diamond is a graduate-level 4-option multiple-choice evaluation where 25.00% is the exact mathematical baseline for pure random guessing. A 350M parameter model fundamentally lacks the architectural capacity to ground true reasoning at this level, meaning this score sits within a basic random distribution.

More importantly, the authors have provided zero evaluation artifacts, scripts, or reproducibility trails—no results.json config and no line-by-line generation traces (samples.jsonl). Pasting arbitrary benchmark metrics into a markdown file with zero verification data qualifies as misleading spam that misrepresents model capabilities to the community. I request that the unverified metrics either be removed or backed up with legitimate log files.

I agree with the baseline point: GPQA Diamond is a 4-option benchmark, so 25% is the random baseline, and a score near that should be interpreted carefully.

The part I would push back on is the claim that a 350M model “fundamentally lacks” any capacity for reasoning. That is too absolute. A 350M model is obviously small for GPQA-level science questions, and this score should not be presented as strong graduate-level reasoning, but model size alone does not mathematically prove zero reasoning ability.

The fair criticism is that the original metric needed supporting artifacts. I’ve now added the eval report so the result can be inspected.
The file i uploaded is https://huggingface.co/Muse-research/Muse-2-350M/blob/main/.eval_results/muse2_gpqa_20260609-144937.json

Sign up or log in to comment