Brainstorming: Suggestions for improving the leaderboard

#477
by xxyyy123 - opened

The top ranks on the leaderboard (not just 7B, but all) are now occupied by models that have undergone merging and DPO,

image.png

completely losing the leaderboard's function of judging the merits of open-source models. It is now necessary to deeply consider the rules that data on the leaderboard should follow.

In my opinion, there should not be any fine-tuned models that include TruthfulQA data. As for GSM8k, it requires deeper consideration; in fact, many well-known models, such as GPT-4, have used GSM8k data as part of their model's training.

Regarding GSM8k, the fine-tuning is more about training for skills rather than direct knowledge injection, which makes the concept of datasets like metamath reasonable. However, it is still necessary to distinguish between the data used for training and the data used for testing.

For example, if the training set contains questions very similar to those in the test set (such as only changing the numbers in the questions), then such data is highly inappropriate.

xxyyy123 changed discussion title from The LeaderBoard is totaly a mess to The LeaderBoard is now totaly a mess

We mustn't forget that benchmarks are a proxy for overall model capacity, I believe training on datasets that are rephrases of the benchmarks we use is quite disingenuous. When fine-tuning a model, one's main goal should be to improve its overall capacity, NOT to improve its benchmarks, as you're praying on the limitations of our current testing methods to achieve a higher score, thus, allowing you to present your model as the top #1 (implying that it is the most capable in all the tested areas) and then disappointing users when tested subjectively.

Many people have been suggesting black-box benchmarks as an alternative, but I don't think this approach is sufficient, as model creators with the mentality to improve benchmarks instead of overall capacity will become optimizers, not for real-world performance, but for benchmarks. The community will evolutively find themselves datasets that more and more align with the test-sets (unknowingly), iteratively improving scores while subjective evaluation flat-lines.

I consider this destructive mentality to be a direct product of the leaderboard's nature, where people justify their research based on their placement here. We galvanize and covet being amongst the top-scorers, further promoting this kind of behavior.

However, the search for an infallible, objective method, while impossible, does bear many fruits. I believe there's a few things that can be done to make this whole thing more genuine:

  1. Larger, more diverse, benchmarks. (With no train/validation sets, will make it harder to 'evolve towards')
  2. Periodically changing benchmarks over time. (Will invalidate models that are fitted for benchmarks and force overall model capacity)
  3. Implementing https://github.com/swj0419/detect-pretrain-code-contamination/tree/master or some form of contamination detection.

#2 Is a throwback to when the HF team added Winogrande and GSM8K as benchmarks for this leaderboard, severely rearranging the top-scorers placements, many have never been seen again to this day.
#3 Is something I'm working on, then, became aware that SaylorTwift (part of the HF team) was doing the exact same thing (This would be a separate HF space where models can be tested for contamination).
#1 While necessary, is difficult to produce.

Those are my thoughts on the leaderboard at the moment. I really love this place and would like to see it succeed in its goals.

image.png
very funny

Hugging Face H4 org

Hi ! Thanks for your feedback, there is indeed an issue with data contamination on the leaderboard.

The top ranks on the leaderboard (not just 7B, but all) are now occupied by models that have undergone merging and DPO, completely losing the leaderboard's function of judging the merits of open-source models. It is now necessary to deeply consider the rules that data on the leaderboard should follow.

Unfortunately, it is difficult to know what data a model has been trained on. That is why we are trying to build tools to detect data contamination.

In my opinion, there should not be any fine-tuned models that include TruthfulQA data. As for GSM8k, it requires deeper consideration; in fact, many well-known models, such as GPT-4, have used GSM8k data as part of their model's training.

One of the way of discerning models that have been trained on data including truthfullQA or other becnhmarks would be a data contamination tool (We are indeed working on it with Weijia Shi).

Regarding GSM8k, the fine-tuning is more about training for skills rather than direct knowledge injection, which makes the concept of datasets like metamath reasonable. However, it is still necessary to distinguish between the data used for training and the data used for testing.
For example, if the training set contains questions very similar to those in the test set (such as only changing the numbers in the questions), then such data is highly inappropriate.

Again, this can be solved using tools to evaluate a model's contamination on tests set. However, the issue is that a contaminated model in not a bad model and could still be considered for real word use.

Overall, while I agree that fixed benchmarks alone do not give complete information on a model's real world capabilities, I think they give an accurate quantitative metric for quality on certain tasks. They should, however, be used along with a more qualitative metric that reflects real world usage, like the chatbot arena. Moreover, there should be a way to test whether a model was contaminated on a fixed benchmark to give an even better understanding of its capabilities.

Finally, we are working hard to provide a tool that would allow people to test models for contamination on different benchmarks (as well as copyrighted materials and sensitive information). We will run tests on those 7B models and remove them if needed. Thanks for your concern and patience ! :)

Does detecting data contamination alone prevent the 7B model from dominating the leaderboard?

I have mentioned multiple times here that people will spare no effort to cater to the leaderboard. For example, if you don't support custom input formats and system prompts, then people train according to the benchmark format. And those meaningless homogenous 'top' model fusion—we all know this lacks scientific basis, resembling alchemy from the Middle Ages.

I believe that the philosophy of the leaderboard has fallen into the trap of formalism. If it cannot be corrected, I am confident that we will soon see applications from models voluntarily withdrawing from the rankings and further boycotts... This leaderboard has unjustifiably harmed the reputation of some really good models, influencing investor opinions and societal perceptions.

Or, please consider this: when did the potential benchmark leakage (training not done on the original text) become a criterion for evaluating the quality of a model? Is training on extremely similar tasks, especially in mathematics, inherently condemnable? Is there truly no overlap between the test and train splits of these datasets in terms of rewriting or similar tasks?

We must acknowledge the reality that there are many tasks a model can never autonomously perform if it hasn't seen one among a series of similar tasks. If we label this process as data contamination, could it be an overcorrection? The benchmark test data itself is sometimes a rewrite of derived tasks that serve as seeds for self-instruction.

"When a measure becomes a target, it ceases to be a good measure." - Goodhart's law

I certainly believe that is where we are at now, training using techniques that may limit real world use, training on test data or using training on reworded test data, all to chase higher benchmark scores.

Touching on more of the human element here, unfortunately, while it would be nice to believe that everyone is acting in good faith, having one's model score highly and gain popular traction is a means to a number of self serving ends: Influence, job opportunities, and monetary (investors, private services, etc.) Given those potential benefits, there will be bad actors that will create models that play to the benchmarks, despite any real world deficiencies.

Yes, some contaminated models are still good models, the problem is that they get undue recognition and credit for cheating (knowingly or not), which means they stand out while they might be the same as a dozen other models, but the inflated scores let them stand out.

So everyone has to play the game to keep up with those who don't care or knowingly cheat, people stop taking the leaderboard seriously, or honest people just stop caring to contribute knowing that someone else manipulating the benchmarks could outshine their equal, or perhaps better, model.

Possible leaderboard solutions that I can think of:

  1. Switch benchmarks to ones that are more resistant to fine-tuning.
    Best Mistral-7b-0.1 fine-tune of each benchmark compared to base Mistral-7b-0.1:
    ARC: +13.1, Hellaswag: +5.1, MMLU: +1.1, TruthfulQA: +30.6, Winogrande: +4.2, GSM8K: +35.3.
    There seems to be some obvious culprits in hugely increasing a model's score. Can't we just switch to using more benchmarks like MMLU which are harder to fine-tune for?

  2. Implement contamination checks.
    I just don't see how this wouldn't create loads of arguments and go against the point of the leaderboard. How accurate does the contamination check have to be, how confident? If it is true that MMLU is the most trustworthy benchmark on the leaderboard, then contamination checks may remove the two best models on here, Yi-34b (shown to have a 94% chance of contamination) and Qwen-72b. If HF removes the best models due to contamination, then the leaderboard no longer fulfills its purpose of showing what the best models are.
    Note: The chatbot arena shows that people prefer Mixtral 8x7B over Yi-34b, so MMLU doesn't capture everything.

  3. LLM-as-a-judge which scores based on trained human preference.
    MT-Bench seems pretty decent. Maybe use something like that?

  4. New black box benchmarks
    I like how this solution gives the added bonus of getting to design new benchmarks that more align with what people actually want from a model. Though I understand that this solution would be a lot of work (if there are not black box benchmarks that already exist) and it would decrease transparency.

  5. Human voting
    Probably not going to happen since lmsys chatbot arena exists.

  6. Benchmarks that generate new questions for the LLMs to answer
    Probably not since you want all the models to have a fair and equal comparison + difficulty with ensuring correct answers.

Benchmarks that generate new questions for the LLMs to answer

To enhance the fairness and accuracy of language model leaderboards, I propose exploring the use of a dual-seed approach in evaluations. This would involve generating test prompts using both a public seed, for transparent and comparable results, and a secret seed, to assess whether models are truly learning and not just memorizing.

Neglecting the costs, one could easily hire 30 people that work full time on improving LLM eval for 2 years or so.

clefourrier changed discussion title from The LeaderBoard is now totaly a mess to Suggestions for improving the leaderboard

While I agree that I am very concerned about the 'less than helpful' nature that the leaderboard risks taking on, I tentatively view the situation from a different angle.

I agree that the benchmarks are being gamed for many reasons, as mentioned above. At present, this appears to be a bad thing, much like a student cramming for a final exam who knows beforehand that it will only contain a question on a particular battle and the ability to add two 3-digit numbers. Such a student may not necessarily be good at much else, even if they complete the exam satisfactorily.
However, with humans, we tend to assume, "If you understand that bit, you can probably also apply it to something else over here." I don’t see a reason, yet, to assume ’machines’ should be offered this assumption also.

Is this the fault of the student for 'training on the test data,' or is the test too narrow?
I am here tentatively advocating for (hugely) broadening the scope of benchmarks to capture 'more of what is equatable to knowledge/skill/useful LLM characteristics' that we humans appreciate and recognize.

This seems less relevant if we can determine how, if at all, LLMs and other AI implementations are actually learning to recall and synthesize or merely learning to recall and probabilistically string together words. If they start synthesizing, and we understand how, perhaps we don't need to coax out all the cases but teach general knowledge, reasoning, and just force-feed facts (I'll leave the problem of force-feeding Wikipedia 'false facts' for another post ;-) ).

I suppose we could create a huge open-source crowd-sourced project where domains/subdomains, etc., are further collected, and then questions are broadened (by crowd-sourced questions and voting on answers), with the hope of reaching such breadth that model knowledge mimics synthesis/ intelligence by the sheer volume of 'overfitting'... i.e using brute force as a stop-gap until we figure out the more difficult part...figuring out a friendly way of doing on-the-fly inference with the retention of conclusions. To paraphrase the old Apple Store catchphrase, maybe... 'there's a prompt for that.'

I appreciate any thoughts, as I also feel a bit downbeat about the increasing un-usefulness of benchmarks and my own inability to create my own fool-proof benchmarking framework.

clefourrier changed discussion title from Suggestions for improving the leaderboard to Brainstorming: Suggestions for improving the leaderboard

@Olofp Looks like someone else had a very similar idea. See the new discussion.

https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard/discussions/481

I agree with the notion that trust is a crucial requirement for benchmarks.

Many people point to black-boxed benchmarking, such as here: https://www.reddit.com/r/LocalLLaMA/comments/18kn2pf/we_need_more_blackboxed_benchmarks/, but I don't think black-boxed benchmarks are a real solution. They just move the trust problem from model authors towards benchmark conductors and towards the then hidden benchmark, in addition to the phenomenon of "the community will evolutively find themselves datasets that more and more align with the test-sets (unknowingly), iteratively improving scores while subjective evaluation flat-lines" as suggested by @Yeyito . There seems to be already some distrust towards the current open reproducible benchmarks, but how can you trust a black-boxed benchmark that you cannot examine? How will users and customers know, if that benchmark reflects their use-case? They cannot know, without having tried the model. They cannot, without the black-boxed benchmarks being released to the public in regular terms.

IMHO, black box benchmarks are not the way.

There is one thing that is bugging me: contamination of base models.

While not gaining much traction on reddit, https://www.reddit.com/r/LocalLLaMA/comments/18ikb0e/a_more_intuitive_way_to_score_mmlu_benchmarks/ was an interesting read.

They propose following measures to improve MMLU, which at least partially addresses base model contamination by manually adjusting values.

Introducing DZPAS, an intuitive way to score MMLU that more closely aligns with how people interact with LLMs. DZPAS stands for Decontamination Zero-shot Penalty Adjusted Score. It is an adjustment to the 5-shot MMLU benchmark scores with the following changes:

Penalty Adjusted Score - Since MMLU is a multiple-choice benchmark, random guessing will get you a score of 25%. Here we introduce PAS to penalize wrong answers by 0.25x, such that random guessing will, on average, get a score of 0%. This is more intuitive when trying to understand how often the LLM will get a question correct when there are not multiple choices.

Contamination Adjustment - From this fantastic paper (https://arxiv.org/pdf/2311.04850.pdf) we learn Llama2 has approximately 10% pre-training contamination on MMLU. We use this contamination to adjust scores down (relative to how accurate they are). For example, if a model scores 50%, they will lose (0.5*10% contamination) = 5% of their score for an adjusted value of 45%.

0-Shot Adjustment - While LLMs used to be 'Few Shot Learners', with recent advances in LLMs (instruction fine-tuning, RLHF) it no longer makes sense to evaluate benchmarks in a few shot setting. MMLU is generally always evaluated 5-shot, but when people interact with LLMs they almost never give additional in-content learning examples. 5-shot inflates scores and creates a disconnect between a benchmark and real-world use. Here we use the original MMLU paper (https://arxiv.org/pdf/2009.03300.pdf) to find 0-Shot and 5-shot score differences (from GPT-3 175B) and use that to create a 2nd order polynomial to determine an adjustment factor (which varies depending on the benchmark accuracy). Hopefully in the future all benchmarks will be re-evaluated as 0-shot which should make this adjustment unnecessary.

Finally DZPAS is all 3 of the above adjustments combined in a mathematically robust manner. You can see a larger variation in the scores of common LLMs which more accurately reflects a models capabilities in real-world use. For example Llama-7B goes from 35.7% to 14.0% which means when asking MMLU type questions LLama-7B is only likely to answer 1 out of 7 correctly. With LLama2-70B you see a much less dramatic change going from 69.8% to 57.2%. Also plotted below are the individual contributions of each adjustment factor described above.

I quote following critique by Full_of_bad_Ideas:

0-Shot Adjustment - While LLMs used to be 'Few Shot Learners', with recent advances in LLMs (instruction fine-tuning, RLHF) it no longer makes sense to evaluate benchmarks in a few shot setting.

I absolutely can't agree with this. Base models are not finetuned to handle zero-shot tasks, therefore you shouldn't compare any models that are raw to models that have been fine-tuned using zero-shot benchmarks. After pre-training is done, model is not instruction tuned and requires multiple shots! You've included Llama 1 7B in this test, which is not fine-tuned on instruct. This is why it scores so badly! Llama 2 models on the second hand are not really base models, they have been already fine tuned on instructions. I see this as nothing more than contamination of a base model. We can't encourage that! We, as a community mostly taking crumbs that fall off pockets of corporations (gpt-3.5 datasets, Facebook's contributions) can't allow benchmarks that are zero-shot to gain adoption. Since companies target those benchmarks, they might stop releasing base models altogether and just stick to one chat model, taking route similar to OpenAI. We really can't afford this to happen or this community could die.

We as a community want base model to be released, so maybe we should find ways to make them stand out more, without them being in direct competition to finetunes and instruction tuned models.

I was also thinking about the current practice of flagging models with contaminated datasets. While having public datasets makes it much easier to detect any contamination, I think in general, if contaminated datasets will be flagged, malovent actors might move towards using hidden datasets to make it harder to detect contamination. Then the next step would be for the leaderboard administration to only allow models with public datasets unto the leaderboard, but that in turn would disincentivise some actors from adding their model to the leaderboard altogether, if they are hard set on not sharing their "secrets". So, how to deal with that? Maybe one solution would be to only show models by default that publish their datasets and hide models that do not. So people can still have a look at the ones with secrets, but there will be an incentive to publish the dataset to gain a higher share of publicity.

All this under the assumption that dataset contamination is actually a bad thing. Another approach would be to throw as many tests and benchmarks at models as possible. We do want models that can do EVERYTHING, don't we?

While articles such as https://arxiv.org/pdf/2311.04850.pdf (Rethinking Benchmark and Contamination for Language Models with Rephrased Samples) call for higher decontamination efforts and detection of overfitting, is it really so wrong having a model that is really good at all the benchmarks? Does it not just mean we need A LOT more and better benchmarks? (Although, DROP famously showed how a single benchmark can have huge effects on the average score, which means with low number of benchmarks, model authors are incentivised to specialize in a single benchmark, which will increase their average disproportionally, but to reach the top of the top, you need to be good in all of them, otherwise a single low score will push your average down disproportionally). I assume having a benchmark with a low number of questions and tasks is simply the sad results of being under a budget constraint and not having compute resources available?

However, none of the techniques can solve the problem of contamination because the datasets of the benchmarks are public. We need a benchmark that prevents any possibility of leakage. Or other creative technique to do it, something to stop cheating LLM Leaderboard. It's crucial for huggingface reputation.

Hear me out, we need a hugging face specific benchmark protected by cryptographic algorithms as to NOT be included on the web, so the chances of any contamination is ZERO. Then we can go about ranking models with the new benchmark. I know it's lots of work, but I think this is the only way forward, the rate of contamination is so high that we might have been just benchmarking for contamination at this point.

@clefourrier @SaylorTwift

Hello,

I would suggest something additional to the ideas above, which would indeed reduce cost of CPU in general and that every no-name open-source model gets added in.

When someone submits a model, they are required to do very less indeed and I think, this should not be allowed anymore.

Let's require that they have to fill in: Type, Architecture, Precision, Hub License, Params, Model Sha too and always another field below them, the source link, where they have found the information.
Lets say something like this:
fine-tuned,LLAMAforcausals,float16,llama2,10.73,81236eb57b5f265bc965b860015533f73babdcd4b62ea4548c3db7a99949fea7
mistral.ai, mistral.ai/news/mixtral-of-experts,mistral.ai,mistral.ai,mistral.ai,mistral.ai

If you check the models and the information there and the information is not available immidiately, the evaluation will stay there and will not start unless those datas have been corrected by the submitter and will be removed in two days, so another submitter can try again later.

I hate it, when people add their models, which most of the time are not good or pretty much are fraud and compute power gets lost for those people, while bigger research companies or important people who have good models and all information available have to wait.

I think just requiring those details and their sources, will most easy spammers. It's not working, if someone wants to troll the community regardless, but this will at least let them work for it.

versioned leaderboard.
its not scaling right now. but the pain of changing scores is big...
i suggest a new dropdown, that users can select (aiming to latest by default) where there are different revisions/versions of the leaderboard and its tests. Lets assume we are on v1 now, with these old timers evaluations. We simply create v2 with new evals, the leaderboard tries to aim for the latest stable revision but has also a -dev variant with newer experimental tests. This is merely implemented with git branches strategy.. in theory a dozen lines of code changes...
People when submit evaluation, they submit to latest-stable by defaults. Models that wants to be part of the new revision can raise an evaluation request by themself tho Leaderboard maintainers can aim to at least evaluate the top100. whenever a new board revision is released.
whoever is in the -dev board must understand that the recalculation can be performed at any time. and those on stable can expect just that.. to be stable in score marking and evaluations. but eventually the board and its tests evolve in line with the industry & community..

how about that ?

Side-note: we have to be realistic, resources are not unlimited (manpower and compute). But i think there are many persons around that deeply care about the community and overall the leaderboard and what it represents. Less words, more commits.. but let's get clear what commits are the ones needed .. or maybe this is not even the right place to have such. It seems a pain in the ass, tho this need more transparency if wants to have the stamp of "Community" or even "Open".. Im sure we are not walking ClosedAI chapter 2 and anyone like HF would be deeply interested on free-code, but it will be under our terms.. im sure there is no issue on getting a funded a couple big nodes for this, likely IBM will be very happy to have "OpenLLM Leaderboard (sponsored by IBM)" as their first contribution to the open LLm community right?

what text search field does ? I tried model name, model path partial name - always just all models.

Hugging Face H4 org

@jensdraht we added a lot of metadata checks since :)
@fblgit Completely missed your suggestion of having more or less 2 tracks for the leaderboard - it's an interesting one! I'll add it to the list of things we brainstorm about.
For the rest of the suggestions: blackbox or adding more benchmarks, we are working on it, mostly with partners, so that we can have a list of interesting leaderboards covering different use cases.
I'm going to close this discussion as no additional input was added in over a month, but thanks a lot to you all who contributed to the discussion. We wrote down your ideas and will see what we can realistically implement in the next months.

@2dts it's only looking in the the model names (org/name) and you can look for several models by using a comma as separator.
Next time please open your own issue to avoid spamming other discussions.

clefourrier changed discussion status to closed

There are a few more things:

  • "Submit your model here" should feature a short section about decontamination efforts
  • "Submit your model here" should feature "Read the FAQ and "Explanation of icons" in the "About" section (or in the "knowledge base") of the leaderboard, before you submit a model".
  • The "About" section or the "Submit your model here" section of the leaderboard should be updated with more information about the main selectable sections of the leaderboard, but in particular some information about how the type of finetuning or model architecture (Mergers, Moergers, Pretrained, etc.) will have an effect on the leaderboard scores and hence requires differentiation. E.g. Mergers scoring consistently higher, often because of (intentional/unintentional) data contamination and its compounding effects and therefore a "merge" label is required, as comparison is exceedingly difficult.

In short: I believe at least SOME model authors upload their models to huggingface without following the discussion section, hence are unaware of model contamination and are even more unaware, which model might potentially be contaminated. They just pick the top scoring models on the leaderboard, then do some merging and get a higher score... in in those cases the "Submit your model here" section is the easiest and most crucial place to distribute knowledge. The "About" section is important too, but less so than "Submit your model here". In general, I call for the expansion of the existing or a newly introduced and very very easy to access knowledge base.

Hugging Face H4 org

Thanks for the suggestion, good idea to put info there!
Can you put them in a dedicated issue? I'll try to do the edits this week :)

Hi @clefourrier @SaylorTwift , thanks for opening up the discourse on contamination in open source models. Is there a particular place we can contribute, discuss, or learn more about the effort to add contamination checks to the leaderboard?

Hugging Face H4 org

Hi @czhu2 ,
Thanks a lot for your interest!
We've put the efforts on hold at the moment, as we focused on the release of lighteval - it's likely we'll start them up again soon, and we'll open a new discussion :)

Sign up or log in to comment