Is the `lm-eval-harness` revision or the command to reproduce wrong/outdated?

#396
by alvarobartt - opened

Hi here :hugs:

I tried to run the benchmarks locally, and couldn't reproduce the DROP results because I was relying on the big-refactor branch in lm-eval-harness, as they point out to since it comes with great updated features. Anyway, in the About section you point that the revision used is https://github.com/EleutherAI/lm-evaluation-harness/tree/b281b0921b636bc36ad05c0b0b0763bd6dd43463 and the command uses --model=hf-causal but it should be changed to --model=hf-causal-experimental instead, because that revision doesn't allow the use_accelerate flag.

Thanks in advance!

Open LLM Leaderboard org

Hi! The command should be changed to hf-causal-experimental and it's very likely that the diff was lost somewhere in an edit :)

clefourrier changed discussion status to closed

Thanks @alvarobartt for finding this error, and Hugging Face for resolving it.

However, something is still not right about DROP so please consider giving it more attention. Thanks to DROP scores there's now little to no correlation between LLM scores and real-world performance.

For example, Falcon 180b didn't just do better than Qwen 14b at everything I threw at it, but VASTLY better. Yet Qwen 14b now has a higher score than Falcon 180b on Hugging Face, thanks to a 20 point higher DROP score. And this isn't the exception, but the rule. And it's not just my personal testing. It scored 10 points lower on Arc and WinoGrande, which are gigantic differences that barely puts Qwen 14b on par with small Llamas.

I'm sorry for the rant and wasting your time. But the DROP test not only has widely varying scores that in most cases have no real-world impact on LLM performance, but was added to multiple choice tests like Arc and MMLU, effectively wiping out their impact on the final score. This is because a5 point gain on either test represents a huge and unmistakable boost in LLM performance, while up to a 40 point DROP score differential can have little to no impact on overall LLM performance, resulting in unusable bad LLMs like Qwen 14b scoring higher than vastly superior LLMs like Falcon 180b.

Sign up or log in to comment