Current Eval doesn't reflect what we should be seeking from AI, and Params# its not accounted

#551
by fblgit - opened

For a long while, I find myself dissatisfied with the current state of our board. My perspective is not from a place of arrogance, but of observed potential of what it could be.

I question the value and efficiency of our current systems: What is the true meaning of 'merging'? How effective, truly, is a Model of Experts (MoE)? Is a combined model significantly better than a single one?

Considering the numbers, yes, a MoE might exhibit high performance compared to single model even if its twice the same goon, but is it the best way to gauge models' effectiveness? Proposing a proportional fair score-formulat that considers the size of model parameters seems a more equitable way to rank performance on the board. Nobody will even prefer to run a 300B for 0.03AVG improvement.. and despite is merit and numerically success of GSM512K or ARC_NOFOOL .. as a model, in terms of magnitude: is a failure.

Claiming a 14B, 73.95AVG MoE, and inferior from a 73.89, 7B single model feels unjust, and its none of mine... BUT Scale do matter - a Pythia 410M model can hold its own amidst billions-parameter models without sacrificing efficiency and thats a kickass model much better than any frankenmoenger of chocothousands replicas of the same goon downloaded from the hub and pretending that it "secretly" reinvented the wheel..... This misrepresentation appears neither fair nor accurate, nor is it algorithmic; it perpetuates a fundamental distortion in the scoring system.

I suggest we extend the evaluation beyond average scoring, size, and efficiency to include datasets like MATHQA, PUBMED, LAMBADA, and challenging sets like BigBenchHard and more! as well as setting all evals at 0-SHOTS, like the real-world is, we won't be repeating 5 times the same thing.. so the Model MUST compare to what we really consume. By doing so, we're not merely vying for leaderboard prestige but striving to enhance 'Human-Designed-Compute-Intelligence' or AI, thereby rendering actual benefits to humanity in more truly representative figures.. so all this trend, can outcome into people focusing and aiming to real useful interests and not just a non-sense AVG number that has absolutely no implementation viability in the current landscape.

At its core, my objective lies in strengthening and enhancing the OpenLLM community. If we're collectively aiming at progressing humanity, it seems both prudent and necessary to emphasise 'true intelligence' over extravagant solutions to trivial problems or just tiny decimas that goes nowhere...

The hallmark of an intelligent model is its ability to understand multiple prompts in a truly zero-shot fashion. Its evaluations should align with the responsibilities and areas we genuinely care about, not merely mathematic puzzles but also medical data, real-world number crunching, global facts, updated facts, and much more.

@fblgit In theory a lot of the changes you're suggesting make sense, but I don't think they're doable.

For example, keeping the full set of standardized tests multi-shot vs 0-shot allows for objectively comparing foundational models with fine-tunes, which is crucial for determining the effectiveness of fine-tuning.

Plus when adding specialty knowledge tests that can be easily fine-tuned for like PUBMED, then weaker models focusing on it would see an artificial bump on the leaderboard that doesn't reflect their true strength (intelligence, comprehension, adaptability...). Plus each added test adds another source of contamination, and as it stands contamination is an almost insurmountable problem when testing LLMs.

The current standard set of tests make perfect sense (except TruthfulAQ, which although important, is a contamination nightmare and doesn't impact true performance). Arc tests "intelligence", MMLU tests general knowledge, WinoGrande and HellaSwag test language skills and GSM8K tests math.

This is why I use the leaderboard to get an idea for the general performance of LLMs, which I found to be surprisingly accurate, then do personal testing for things that can't be tested for, such as poetry, story and joke writing. For example, I use scripts to randomly combine 5 parts of a prompt into trillions of unique combinations, then subjectively grade how well the stories, poems and jokes adhere to the generated instruction, how few contradictions there are and so on.

There are places you can go to find specialty LLMs, such as medicine and coding. After which you can come to the HF leaderboard to determine what their true general performance is. Some of the best coding LLMs are astonishingly dumb and litter my prompted stories with absurd errors, hallucinations and contradictions, proving their lack of general intelligence (e.g. comprehension, language skills, logic and memory).

In short, adding specialty tests like medicine and coding can only serve to make the leaderboard scores less representative of the true general performance of LLMs.

I'd like to see more models tested on a multi-turn benchmark like MT bench.

Also if training models to pass a particular test lacks challenge, then that test can lack weight as a measure of overall success. Simple tests that often see inflated scores due to training on the test data could be assigned a reduced weight when computing an overall combined score.

Look, maybe not everything can be done.. but for sure we can replace GSM for something else and stop the witch hunting.

Open LLM Leaderboard org

Hi!
Thanks for this interesting debate, which I had missed.

Regarding the points raised by @fblgit on merging, this is precisely why we have added filters to hide them from the view by default. I agree that merges tend to have inflated scores, either by cristallisation of knowledge or self contamination.

We are going to do an extension of the leaderboard in the next months, and we will try to add benchmarks which are relevant to general model capabilities while being harder than the ones we have currently, like we did in our last update. I however agree with @Phil337 that our aim is that this leaderboard should be general, and adding specialized datasets (or evaluations zero shot only) would downgrade the results of a number of good models just because they don't have these capabilities.

If people, however, want to start specialized dataset leaderboards, we'll be happy to give them a hand, and they can ping me here or on twitter :)

clefourrier changed discussion status to closed

Sign up or log in to comment