arxiv:2412.17758

In Case You Missed It: ARC 'Challenge' Is Not That Challenging

Published on Dec 23

· Submitted by

Borchmann on Dec 25

Upvote

Authors:

Łukasz Borchmann

Abstract

ARC Challenge appears more difficult than ARC Easy for modern LLMs primarily due to an evaluation setup that prevents direct comparison of answer choices rather than inherent complexity. Although some researchers have quietly shifted to a more appropriate scheme over the last year, the implications of this change have yet to be widely acknowledged. We highlight this overlooked shift, show how similar evaluation practices falsely imply reasoning deficits in other benchmarks, and demonstrate that fairer methods dramatically reduce performance gaps (e.g. on SIQA) and even yield superhuman results (OpenBookQA). In doing so, we reveal how evaluation shapes perceived difficulty and offer guidelines to ensure that multiple-choice evaluations accurately reflect actual model capabilities.

View arXiv page View PDF Add to collection

Community

Borchmann

Paper author Paper submitter 1 day ago

We invite readers and colleagues to reflect on how different evaluation methods can dramatically affect our perception of model capabilities and to join us in exploring more transparent testing strategies.