Spaces:
Running
on
CPU Upgrade
Add HumanEval+ benchmark to the leaderboard
There are many good models now but we don't have any coding benchmark on the leaderboard.
Hi ! You can find a specialized coding benchmark here.
Sure, that's a great leaderboard for the Coding model. But models on the OpenLLM are also capable of coding too, if we can also benchmark them to compare, that would be great.
@HoangHa I think it's best if the leaderboard only uses non-specialized tests like reasoning (Arc), knowledge (MMLU), language skills (WinoGrande & Hellaswag), math skills (GSM8K) and comprehension (DROP).
Specialized skills like coding, medical knowledge... are best evaluated outside of the leaderboard. This is primarily because in order to make a general purpose LLM notably better at coding you need to feed it tons of code, and this makes it a poor performing general purpose LLM (they starts pushing out random nonsense outside of coding). In short, once you turn a non-coding LLM into a proficient coder that's basically all it should be used for.
So having a bump in the coding score overshadow the scores from general ability test like Arc & MMLU will mislead people into thinking it's something they should try (because of the high average score), when in reality it performs worse at most things compared to other LLMs much lower on the leaderboard.
As @Phil337 said, the Open LLM Leaderboard only focuses on more general benchmarks. While it would be possible to add more specialized tasks, doing so would require a lot of compute and time, so have to choose carefully what tasks we want to add next, the best thing would be to have many different leaderboards for specialized tasks (coding, multilingual, chat, etc..)
Great, understood. Thank you both of you.