open-llm-leaderboard/open_llm_leaderboard · Add HumanEval+ benchmark to the leaderboard

HoangHa

Dec 10, 2023

There are many good models now but we don't have any coding benchmark on the leaderboard.

SaylorTwift

Open LLM Leaderboard org Dec 10, 2023

Hi ! You can find a specialized coding benchmark here.

SaylorTwift changed discussion status to closed Dec 10, 2023

HoangHa

Dec 10, 2023

Sure, that's a great leaderboard for the Coding model. But models on the OpenLLM are also capable of coding too, if we can also benchmark them to compare, that would be great.

HoangHa changed discussion status to open Dec 10, 2023

deleted

Dec 10, 2023

•

edited Dec 10, 2023

@HoangHa I think it's best if the leaderboard only uses non-specialized tests like reasoning (Arc), knowledge (MMLU), language skills (WinoGrande & Hellaswag), math skills (GSM8K) and comprehension (DROP).

Specialized skills like coding, medical knowledge... are best evaluated outside of the leaderboard. This is primarily because in order to make a general purpose LLM notably better at coding you need to feed it tons of code, and this makes it a poor performing general purpose LLM (they starts pushing out random nonsense outside of coding). In short, once you turn a non-coding LLM into a proficient coder that's basically all it should be used for.

So having a bump in the coding score overshadow the scores from general ability test like Arc & MMLU will mislead people into thinking it's something they should try (because of the high average score), when in reality it performs worse at most things compared to other LLMs much lower on the leaderboard.

SaylorTwift

Open LLM Leaderboard org Dec 10, 2023

As @Phil337 said, the Open LLM Leaderboard only focuses on more general benchmarks. While it would be possible to add more specialized tasks, doing so would require a lot of compute and time, so have to choose carefully what tasks we want to add next, the best thing would be to have many different leaderboards for specialized tasks (coding, multilingual, chat, etc..)

HoangHa

Dec 11, 2023

Great, understood. Thank you both of you.

HoangHa changed discussion status to closed Dec 11, 2023