Brainstorming: Call for a Time-Sensitive, Rolling-Update Benchmark Crowdsourced by the Community

#481
by JosephusCheung - opened

The existing benchmarks with fixed questions are destined to become obsolete due to Goodhart's Law. If we can collect multiple-choice QA benchmarks through community crowdsourcing, similar to OpenAssistant, and have GPT evaluate and classify, creating a new benchmark updated on a seasonal or monthly basis, it could mitigate issues such as cheating and potential data contamination.

Crowdsourced questions can be assigned as tasks to participants in the evaluation, building an automated annotation pipeline, where the management pressure only exists in the selection of benchmark questions ultimately released.

I believe that in this sort of benchmark, data contamination is no longer a critical factor for the results.

Why do I oppose data contamination detection as the sole negative indicator? Because it's challenging to equate data contamination with the quality of the model. Having no contamination is good, but there is no inherent connection to the model's capabilities.

Furthermore, we cannot detect all types of cheating. We know that direct training on the test set and rewriting versions of the training test set can be detected, even though the process is somewhat complex. However, I believe that if there is targeted pre-training on benchmark questions using semantic and keyword retrieval on a large amount of text (CC, RefinedWeb), resulting in an 'unfair' improvement, this bias in the dataset distribution cannot be detected.

Hugging Face H4 org
edited Dec 18, 2023

(Hahaha I hadn't seen your issue when I created this one! Moving my suggestions here :) )

Following discussions on Twitter with @JosephusCheung and the above points, I've been thinking about a hard to game leaderboard that we could implement.

Let's imagine we get a community sourced dataset of 500 non trivial multi choice QA questions (because loglikelihood evals they are less costly), that is user built, but not public.

We could split it in 40 questions per month (20 questions every two weeks), and either every month or every 2 weeks, evaluate our models on this "vibes" dataset. The question and model scores would only become public at the end of an evaluation period (so once every 2 weeks, or once every month).

It would allow to have a rolling score on all models, and see when models are trying to game the performance: if a model submitted in March retrospectively gets super good results on Jan to March, but bad from April, it's probably cheating. We could display the average best + min best at the end of the year.

The compute should be less costly if we have very few questions, but we risk having some months were the signal is bad (because a question or 2 are broken for example) - but it would still be an issue to figure out.

Tbh, this would have to be a parallel leaderboard than the Open LLM one, and I'm not sure how we could manage the compute atm, plus I would have the bandwidth on this from February at the earliest, but it's a draft of a direction we could go in. Wdyt folks? Any suggestions?

clefourrier changed discussion title from Call for a Time-Sensitive, Rolling-Update Benchmark Crowdsourced by the Community to Brainstorming: Call for a Time-Sensitive, Rolling-Update Benchmark Crowdsourced by the Community

Here is an example of a subtle, high quality bench, and I like it very much: https://benchmarks.llmonitor.com/prompts (some seemingly easy questions turnd out to be really hard, few oss models but all commercial ones passed.)

But I think there are some things we should consider:

  1. Perhaps we don't need such a small quantity. The number of users submitting evaluations and those paying attention to evaluations can be enormous. If we can assign tasks to those who submit models while collecting questions from volunteers who only observe, we can obtain an objective quantity and community engagement.

  2. The types of tasks could possibly be further subdivided. I believe OpenAssistant is a good example, but its form can be simplified. We can use GPT or other open-source models to assess user-submitted questions and answers. The platform can introduce variety by providing seed tasks or randomly assigning task categories, collecting a wide range of tasks with category features.

  3. We require a certain level of expertise in domain tasks, which is a challenge compared to existing evaluations. Especially in non-CS releated fields such as social sciences and medicine, specialized tasks may require dedicated pipelines to ensure quality. It's challenging to rely entirely on volunteers, and perhaps consideration should be given to paid annotators.

This is not an easy task, but I believe it can form a paradigm. All things are difficult before they become easy.

Hugging Face H4 org

Regarding 1, the biggest problem is the compute cost - unless there is a possibility to crowdsource that too.

Alternatively, do we really need historical participants? If it's a brand new leaderboard, it could alleviate the burden. Things change rapidly, and people should be encouraged to proactively submit models. Then the amount of data we can collect is also proportional to the level of participation. I believe this is reasonable.

I propose an idea to improve the evaluation of natural language models within the huggingface leaderboard. My proposal is based on the following points that I leave here for the community to appreciate:

Create an internal benchmark with questions from various domains and of great value and originality for the community, provided by the community;
Use an honest model, such as GPT-4, to classify the questions by domain and select evaluate these questions by domain;
The huggingface team evaluates and analyzes the selected questions, verifying their quality and difficulty;
Provide a brief description of the dataset and the evaluation criteria to the community, without revealing the questions;
Implement a security or encryption system to protect the internal benchmark from possible leaks;
Avoid security errors that allow the leakage of the benchmarks, as this may influence or bias the models;
Use a selection or random draw algorithm of the questions by domain to test the candidate models, reducing the computational cost;
Allow the candidate models to be evaluated on the internal benchmark only after passing the current evaluation.

I don't think a closed benchmark would change the status quo, people's over-optimization according to Goodhart's Law won't change by this way - they will still, merging the top-performing models and then choosing a better to merge again. Either it is inherently immune to illegitimate optimization due to its complexity, or it undergoes extensive, big and comprehensive testing, or it adapts through a progressive rolling benchmark - these are the solutions I can currently think of.

I thought that @Olofp made a good point in this thread. We should broaden the scope of benchmarks so that they test on many different questions in many different subjects so it'll be harder to overfit for. MMLU already does this and maybe that's why it is practically immune to fine-tune overfitting. Just make an MMLU that tests on all the skills people actually want from an LLM and you've got a great benchmark. For extra measure you could make it black box.

@JosephusCheung Yeah I think the complex extensive testing makes more sense. This leaderboard tests what, like 30 models a day? No way would a rolling benchmark work, needing to have the compute to retest like 4,000 models every 2 weeks/month.

Labeling LLMs appropriately (e.g. merger), providing the ability to filter them out (e.g. hide mergers) and testing suspicious LLMs for contamination, should be good enough to keep the leaderboard tidy. Perfect scoring isn't possible or required, and the leaderboard is already looking much better.

A rolling, hidden, community contributed... benchmark is good in theory, but it's more of a stage 2 option if the aforementioned fails.

And if a rotating offline test is made the results probably shouldn't be made public. Instead, the test should be used internally by HF to identify LLMs with suspiciously high scores on MMLU, Arc, TruthfulQA... triggering an automatic investigation and contamination testing.

@Phil337 I agree with this, I know It's a hard measure, but It's card on the table. And shouldn't be made public the results is a good point. But, a quarantine time to testing suspicious LLM's for contamination can bring for leaderboard more stable results.

this is a nice solution, factor in crowd bias and we're game.

I very much like the idea of a crowdsourced benchmark, but who is fact checking the questions and their answers?

If questions were to be user-verified, the benchmark ceases to be closed. A rolling benchmark like clefourrier mentioned would indeed catch models trying to game the benchmark. But it does not take away the fact that we'd be creating a very interesting dataset to train on nonetheless, which can be captured by pretending to verify questions. (Is this something we want?)

If they are not user-verified, how do we verify them? Or should we just let erroneous questions slide, claiming they won't impact performance that much? I think a team manually verifying is either a time-consuming or costly action, still prone to errors if they fall outside of the expertise of the verifier.

To avoid running all thousands of models, I would make model creators/users specifically send a request to test a model in the next benchmarking run and the following 3 months or something.

Another note, and I hate to be that guy, but what about the legal aspects of this benchmark? The users create it, and its now Huggingface's dataset? Meaning they could do what they want with it. (It's up to the individual to decide whether or not that's fine for them.) I think we should, however, be aware of what we are getting into. This benchmark would be useful only here on this site and if it is ever compromised, we wouldn't have an alternative. It's still a good solution for now, but I don't think that it's fool-/futureproof. I guess we'll get there when we get there.

I'd like to have GPT do the quality inspection and only require a small manual management. Maybe there should be a duplicate filtering and anti-spam to prevent someone from constructing fake data.
As for whether the benchmark test is closed, my idea is to make the outdated test data public while updating the benchmarks on a rolling basis. If someone wants to submit his model, he must also make a small contribution to the next benchmark, maybe one or two test entries. I believe that, with the combined influence of user moral standards and GPT scrutiny, the occurrence of erroneous or intentionally harmful data will be minimized to the lowest possible level, requiring only minimal effort from the mods.

we're data scientists, right? let's train a strong model for fact checking and bias control of the questions people create, that model could be updated frequently to the needs of the hugging face, from there we could have a pipeline that takes the data, checks, if good go, if bad no go, and then a human supervisor can choose the best related ones for a small handy dataset for a certain time period.
this is a really rough idea, but I think automating some work in this regard would go a long way.
I don't think we have any other way than a "dynamic benchmark", as the capabilities are forever enhancing and changing.
we can further set out the proper rules to this "dynamic benchmark".
as for people who are sending these questions, it could either go by people who send models for benchmarking as their contribution like @JosephusCheung said, or it could be crowdsourced on the condition that user gives their contribution under a cc0 license, later hugging face can choose to open source the extra unused data of the old months or no, can ask the users again too if you need to be so picky about cc0 license.

why not GPT?! I don't trust closed source AI. lol. sorry.

Hugging Face H4 org
edited Dec 20, 2023

I think there are several directions being explored:

  • should we extend the capability measures of the models, or just do a well rounded (but brand new) dataset on tasks that already well known?
  • should the evaluation be generative or even multi-turn (which would allow to actually test what interest people, like the llm monitor) or multichoice (which is considerably easier computationally speaking)?
  • should we rely on GPT like models to rank community provided questions (you rely on a closed API), or on people from the community (some people know what is in the dataset)?

My personal answers to these questions would be: do a well rounded but new dataset, with multichoice evals so they are not too costly, manually checked by people from the community, covering rolling evals for the scope of less than a year, but ofc in a rolling fashion.

I think doing a v0 of such a project would allow to see:

  • if more people want to participate in such an initiative on a rolling basis
  • if it actually gives the signal we expect and is worth a bigger/collective effort

@Yure-Pleb re trusting HF, which is a good point, always important to think about who the actors are - the goal would be to release the dataset after each period of eval, so it would in the end be public, but yes, that means trusting HF (or another partner if you'd like that better) for a year
@JosephusCheung I really like the "submit your model, submit your question" idea, but we would have to store who submits what to avoid bad faith people
@Phil337 re-labelling correctly - yep, we are adding more filters on model cards :)
@Maani (me neither)

I believe you should trust in the enthusiasm of the community. I think such a crowdsourced data can not only fulfill the evaluation task but also serve as a valuable synthetic training data source—using a similar to the approach of MetaMathQA. We can naturally focus on training specific tasks on this and avoid accusations on possible contamination.

This way, we can treat all outdated test sets as trainable content that has entered the public domain, also retrievable through data collection channels like web crawlers. This avoids stigmatizing data contamination excessively—I always believe that it's the benchmark's outdatedness that should be corrected, not something actively avoided in data collection. Emphasizing the exclusion of data contamination inevitably results in some high-quality data being unable to be actively trained due to non-copyright issues, which is regrettable.

In other words, our goal is not to identify who is shameful due to data contamination but to eliminate the impact of data contamination on evaluations - by using new, updating benchmarks.

This has also been brought up in the LocalLlama subreddit https://www.reddit.com/r/LocalLLaMA/comments/18mnm7n/creating_a_blackbox_leaderboard/

Can we have the best LLM of the day generate question answer pairs? Can be template based. Take existing questions and make templates out of them. The LLM will change the words and numbers around, but still essentially the same question.

Contamination is not something anyone can really control, especially if you didn't craft your dataset or used crawlers and automation to generate your dataset, that's why eval datasets must be totally closed source to ensure the highest possible degree of true examination of the model's capabilities. @JosephusCheung
just airing out my ideas on @clefourrier 's questions:

  • should we extend the capability measures of the models, or just do a well rounded (but brand new) dataset on tasks that already well known?
  • we should use extended capability measures on the basis of model architecture, and it's presented capabilities by the authors, for current models a brand-new dataset would suffice. for a multimodal based model, they will not work out as well. rolling updates to the eval dataset fixes both of these situations.
  • should the evaluation be generative or even multi-turn (which would allow to actually test what interest people, like the llm monitor) or multichoice (which is considerably easier computationally speaking)?
  • it must not be generative at the beginning, as there's still a chance of contamination for the model that's gonna generate these evals, either we need to build such model from scratch on an empty transformer or it still has the chance of contamination, for doing so, we can use the data generated by users and cleaned and prepared by pro data scientists to train the model, that model MUST be specific to hugging face only, and it's alright since hugging face is the open source AI platform and we all gonna use it at some point anyway.
  • should we rely on GPT like models to rank community provided questions (you rely on a closed API), or on people from the community (some people know what is in the dataset)?
  • no sir, not at all =) explained in previous.
  • if more people want to participate in such an initiative on a rolling basis
  • a reward based model for participating researchers for eval of their models or even a fair amount of compute points would be very helpful.
    just trying to help and give back to the hugging face community here, nothing more. peace.
    PS: oh btw, if we made such a model that makes it an attack vector for bad actors, ensuring a powerful cryptographic algorithm is protecting the model or even keeping it totally offline on hugging face HQ would be a good solution. please do add any other solutions I might be missing here, imho this is actually very important.

I believe it might be useful to have a few different categories of questions, perhaps including multi-turn. But maybe we should split them up, and create each category when the previous one is fully functional. My thoughts:

  • I'd start off with only multiple choice. I do suggest using a variant which let's the model reason before answering, this might be the best performing way while remaining relatively easy.
    i.e. \n <model's reasoning> \n <Model's final answer>, where it should follow a certain (simple) template for the answer, like "Answer: A" or "Result: 10"
    If you just take the last line of the model's response, it should be fairly straightforward to get the answer back. (I think this idea is very similar to something AI Explained mentioned in one of his video's) This would mean no arbitrary ranking.

  • We should establish what to trust from models. I do not trust models grading eachother, for example. Perhaps I could trust a verifying model to grade a response 0 or 1 depending on given very simple criteria. Which could look like this:

    • Given a question where we are comparing 2 elements of a list, we could ask the verifying model: "If the answer prevents a value x getting compared to itself, grade it 1 else grade it 0, what is the grade for this answer: {answer}", this criterium checks whether the answering model has picked up on the nuance of not comparing a value to itself. (This assumed a naieve double for-loop was used, but I hope I got the point across)

    Providing these grading questions to a benchmark question does make them harder to create, but I believe this more accurately shows the capabilities of a model. It also does not require a stronger model to verify, as the creator of the question has done the hard work by creating the criteria. (These questions will be hard to verify, as grading criteria can sometimes be debatable) This is ideally checking that the steps the model is making, make sense.

  • I would hold off on multimodal for now, I think the same principles would apply, just with more steps.

  • Regarding a reward based model, of course this would be nice. But (if, like @clefourrier mentioned, they release the dataset) HF is not really gaining anything from this (only like credibility in model eval?), and I doubt they have much compute to just give to users for creating a benchmark that they themselves will run for nothing too. (But maybe someone wants to sponsor? 😉)


(Just a thought I had) While the goal of this thread is to create a benchmark, it could easily expand into a full-fletched dataset.
For example:

  • "What is the capital of {Country}? {Options}" And replace that with values from something like:
    • [{ "Country": "France", "Options": ["Paris", "Lyon", "Marseille", "Nice"], "Answer":"Paris"},
    • {"Country": "Germany", "Options": ["Cologne", "Munich", "Berlin", "Hamburg"], "Answer": "Berlin"}]
      If then asking a model to rephrase the question, you get many more slightly different questions. (A sanity check would be to ask a different model if the question still makes sense) Now, if you somehow could easily generate these entries for fields like math (just a function), relatively little work gives you a large amount of reasonably high-quality, (kinda) synthetic data. (I don't know how useful this is, this is probably already being done) (Pretty much like @DaniCar mentioned)

as mentioned before, a cc0 license of the data provided by the users means it's in public domain, but since provided for hugging face, there could be a term that specifies this data belongs to hugging face after being collected. if the data is being used is upon the human experts that evaluate this data for model training
as for the "sponsor" for compute points OR evaluation points, I don't think any is needed, because data collection would be periodic and for each data point collected a fraction of compute OR evaluation points could be assigned.

There's been a few suggestions about generating some questions from existing LLMs.

A hard truth: LLms are just public datasets with some jitter. A Blackbox evaluation must arise from a novel distribution, otherwise we're just guessing to what extent the evaluation is polluted.

Hugging Face H4 org

For those interested, there was last week the introduction of the NPHardEval leaderboard, which uses a dynamic benchmarking system with automated questions generation

clefourrier changed discussion status to closed

Sign up or log in to comment