Potential data contamination with regards to ultrafeedback-binarized and Nectar datasets

#474
by killawhale2 - opened

ultrafeedback-binarized and Nectar both contain data using TruthfulQA prompts.
For ultrafeedback, an issue was already raised in the past, https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard/discussions/361.
For Nectar, you can check the source attribute for data samples from TruthfulQA.

Unfortunately, it seems many models on the leaderboard (particularly high-scoring 7B models) are affected as they either used the contaminated datasets or are merged (and further fine-tuned) from models that used contaminated datasets.

I hope the HF team can figure out a way to properly look into this issue.

What, you don't think a 7b Mistral created by newbie with 2 followers is the best open source model currently available? I think you're just jealous.

https://huggingface.co/rwitz2/go-bruins-v2.1.1

If this is true it would require flagging the following models:

jan-hq/trinity-v1 - merge contains viethq188/LeoScorpius-7B-Chat-DPO which was trained on Nectar
janai-hq/trinity-v1 - same as above

rwitz2/go-bruins-v2.1.1 - Based on jan-hq/trinity-v1 which is contaminated as mentioned above

rwitz2/go-bruins-v2.1 - Contains viethq188/LeoScorpius-7B-Chat-DPO which is trained on Nectar

GreenNode/GreenNodeLM-v3olet-7B - Contains rwitz2/go-bruins-v2.1.1 which is contaminated as mentioned above

GreenNode/LeoScorpius-GreenNode-7B-v1 - Model page no longer exists but I believe it contained the LeoScorpius or GreenNode models mentioned above. It should probably be removed regardless since it no longer exists.

upstage/SOLAR-10.7B-Instruct-v1.0 - Trained with Ultrafeedback_binarized
(They used a cleaned version)

rishiraj/meow - Finetuned from upstage/SOLAR-10.7B-Instruct-v1.0 which is contaminated as mentioned above
(Source model used a cleaned version)

viethq188/LeoScorpius-7B-Chat-DPO - Finetuned with Nectar

GreenNode/GreenNodeLM-7B-v2leo - Model page removed but I believe it contained LeoScorpius which is contaminated as mentioned above.

There may be more, these are only the ones with traceable roots I found in the top 20.

Note: https://huggingface.co/upstage/SOLAR-10.7B-Instruct-v1.0 is trained using ultrafeedback-binarized-cleaned from allenai which removed the TruthfulQA prompts from ultrafeedback.

Hugging Face H4 org

Hi ! Thanks @HDiffusion for your work ! I tested models for contamination and found that they are indeed contaminated on at least truthfulqa. To do so we used a tool to easily test for contamination, it is still a WIP and therefore not yet available but here is a screen of the results:

Screenshot 2023-12-17 at 17.19.12.png

Hugging Face H4 org

The models mentioned will be flagged, we will also start hiding flagged models by default to avoid losing readability on the Leaderboard.

@SaylorTwift Will it be possible to test all existing leaderboard models retroactively? Could this also be a default and automated procedure for every new model submitted?
It would be very nice if this test is done on every new model during evaluation. A score >0.9 would result in manual checking and dismissal if found true. This would be a huge gain for this leaderboards' reliability. Sadly, at this moment, the leaderboard has lost its meaning. There is just no way that a 7B model is generally better than a 70B model. It is possible if we talk about very specific tasks, but not overall! It would be revolutionary if that was the case.
Likewise, it would also mean that all current 70B models are basically trash. Which I doubt

@Wubbbi If it helps, SaylorTwift said a little more about it in this discussion.

https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard/discussions/477

Hugging Face H4 org
โ€ข
edited Dec 17, 2023

Hi @Wubbbi , testing all models at the moment would require a lot of compute as we need individual logits which were not saved during evaluation. However, a way to do it would be to have a space where users could test suspicious models and report results by opening a discussion. If the model is in fact contaminated, we will flag it, and it will no longer appear on the leaderboard by default.
I opened a PR here implementing this and flagging models that we already know are contaminated.
I believe that with the help of the community we can make the leaderboard a great resource, as @Phil337 said, we need clear ways to inform users about the models that are displayed, all the discussions we had with you guys and the PR I just opened are a step toward this, so thank you for all the ideas and feedback :)

@SaylorTwift
I wonder if it would be a good idea to force model cards to clearly indicate their model lineage (e.g. which model was it fine-tuned/merged from) before submitting to the leaderboard.

Reasons:

  • From what I could gather, the current contamination detection technique requires a reference model, which means the model lineage information is needed for accurate testing.
  • It would make sense for models to clearly indicate their lineage, not only to give credit where it's due, but also to make contamination detection easier as one could test a single model for contamination from which many models were finetuned/merged from (such is the case right now I believe). Thus, one could test just a handful of popular models instead of having to test all suspicious models individually.
Hugging Face H4 org

@killawhale2 I think adding model lineage to model cards is a great idea - we could probably discriminate against this if it's added to the model metadata.

Tagging @davanstrien here as you worked on taxonomy and @Ezi as you worked on model cards, do you have inputs on this? (Is it already in model cards, could we add it easily?)

How about hiding models by default that were trained on a contaminated dataset? Basically in addition to flagging models, also flagging or black-listing contaminated datasets? Is there a way?

@clefourrier I merged two models together with mergekit (go-bruins-v2.1.1 with OpenChat). As go-bruins-v2.1.1 is flagged now, I want to delete my model that is currently submitting: Fredithefish/Oryx-7B.
Is there a way to stop it from appearing on the leaderboard?

Also this one should also be flagged then:
image.png

@Fredithefish That's just a drop in the bucket. There are a lot of models that need to be flagged. Thanks to the push-button ease of merge kits people started merging the highest performers (a.k.a. the contaminated models) in order to get a slightly higher scoring model, and so on. Plus many applied additional fine-tuning with contaminated data to boost the scores a bit more, and then those models were then merged. There are now mergers of mergers (8+ different models). Thankfully HF is taking steps to address this issue.

Hugging Face H4 org

@Fredithefish please open a dedicated issue for model removal :)

I wonder if it would be a good idea to force model cards to clearly indicate their model lineage (e.g. which model was it fine-tuned/merged from) before submitting to the leaderboard.

This can be done for single models via the base_model metadata. We're working on a better way of doing this for merges which should be ready fairly soon!

rwitz2/go-bruins-v2.1.1 works better than all the 7B models I've tried, I wonder how much the results would have changed if there were no traces of TruthfulQA prompts in this model. It's sad that the only solution is to remove it from the leaderboard

Hugging Face H4 org

@Slayery technically, we don't need to remove it from the leaderboard, we can just flag it so that users can know that the scores are likely better than they should

Hi guys, can someone explain me, like for a 4 year old child, what this conteminations means? I am trying to understand the discussions, but i only understand, that some models are pre trained in a way, that they can cheat the leaderboard tests?
I am using the models mostly for chitchat in some fantasy games, mostly in silly tavern. I have tested many models up to 30b. Mostly i am using 13b models because they are fast for me.
But now i had downloaded many of the 7b models and they performing better than the 20b models. This rwitz2/go-bruins-v2.1.1 is the best of all (for my needs!). Followed by the quantumaikr/quantum-dpo-v0.1.
Solar is slightly worse in the formulation and follows less the story text. The other models I used to use are much worse than the current 7b models.
Its hard for me to understand whats bad now, although these contamined models are better than anything i've known before.
Or are these problems more relevant for people who merge the models and thus create a cascading, ever increasing error?
thanx.

Hugging Face H4 org
โ€ข
edited Dec 18, 2023

Hi @LeChuck76 ,
Contamination is when people use (some or all of) the test set to train their models (deliberately or accidentally) - this is bad because it gives them inflated scores on the evaluations we use for the leaderboard.
It's as if, in class, you had learned answers of a specific math test by heart - it does not mean you'll be able to generalize, which means, do as well on similar tasks that you have not seen as such precisely.

This is a problem, because

  • we use the scores of models as proxies for their performance on specific tasks: GSM8K, for example, is a proxy to understand math abilities - but it the model is repeating answers already seen, it does not prove anything else than "models are good at repeating what they have already seen", which we know.
  • some people also want to compare models using their scores, and if models are "cheating", then the ranking is useless.

Depending on your use case (as you might have seen), it can be not a big deal, if the capabilities that the tests measure are not of interest to you - for fantasy game dialog, I assume you don't really care about how well your models do at math (GSM8K), or on how well they understand, say, real world economics or physics (MMLU)

@LeChuck76 They usually aren't cheating. Unless you're very careful some of the 100s of thousands of examples used for fine-tuning LLMs overlap with some of the questions in standardized tests (contamination), artificially inflating their scores.

If you're just chatting with LLMs then 7b Mistrals are usually just as good, assuming they're fine-tuned for chatting. Chatting almost never requires the underlying power of LLMs (e.g. coding, math and reason).

And in short, what's bad is that the contaminated mergers are far less powerful than Yi-34, Llama 70b and Mixtral LLMs, yet are burying them on the leaderboard.

@clefourrier thanks for the answer. Thats bad for automated testing. In this case you can only use, like in the school, some randomized black box tests. And thats not what "open" means. Its a shame that there are always people who make you think about something like this.

@clefourrier @SaylorTwift In my opinion more models need to be flagged. I might be wrong but here is a list along with mapping the root cause:

  1. EmbeddedLLM/Mistral-7B-Merge-14-v0.2
    Merged version of janai-hq/trinity-v1 which is already flagged and contaminated.

  2. AIDC-ai-business/Marcoroni-7B-v3
    Mysterious DPO of Q-bert/MetaMath-Cybertron-Starling which is merge of fblgit/una-cybertron-7b-v2-bf16 a version of which is already flagged and contaminated. Also check https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard/discussions/444. I also suspect GSM8K contamination.

  3. Toten5/Marcoroni-neural-chat-7B-v1
    Merge version of AIDC-ai-business/Marcoroni-7B-v3

  4. Toten5/Marcoroni-neural-chat-7B-v2
    Merge version of AIDC-ai-business/Marcoroni-7B-v3

  5. mindy-labs/mindy-7b
    Merge version of Toten5/Marcoroni-neural-chat-7B-v2

  6. jan-hq/supermario-v2
    Merge version of AIDC-ai-business/Marcoroni-7B-v3

  7. jan-hq/supermario-slerp
    Merge version of AIDC-ai-business/Marcoroni-7B-v3

@rishiraj Yes, Marcoroni and everything that includes it needs to be removed, or at least tested. It's already been flagged a while back.

https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard/discussions/471

It's scores 5 higher on TruthfulQA than the model it was mysteriously DPO fined-tuned on. That's simply not possible. Plus it scored worse in my personal testing than the model it used MetaMath-Cybertron-Starling, which as you said, is itself contaminated (Cybertron).

@LeChuck76 yes, I also noted that for my tasks this model is the most suitable, it copes best with such tasks for me - creating surveys on a given topic, dialogues (including understanding the emotions that a person experiences when writing), understanding humor (this the best model of all time in this), and that's not even all.

@Slayery This isn't about Bruins not performing. It also performed well for me, but Trinity v1 did better (also contaminated). All this means is the scores are notably higher than they should be. For example, it isn't unusually smart (Arc of 72), truthful (TruthfulQA of 70) or good at math (GSM8K of 71). Other Mistrals do better in these areas despite having scores that are 5+ lower on the respective tests. It's real score is not 75, but closer to around 66, which is still very good for a Mistral 7b.

@Phil337 Well, this is your experience, but my experience is that the model I mentioned above was and is superior to those that were released before and have a lower rating. For my purposes, this model is really unusually smart in understanding emotional context or humor, as I already wrote. Previously, I used the MetaMath-Cybertron-Starling model, which also showed excellent results compared to models with lower test results, but it was also mentioned here as โ€œcontaminatedโ€. It would be interesting to know at least what percentage of these thrutfulqa prompts were in the dataset.

@Slayery You're right. They are better. The previous leaders like Dolphin, Zephyr, OpenHermes, Neural and Starling are too one-dimensional (focus way too much on SFT, DPO, multi-turn chat, RLAIF/RLHF...). Each one fails miserably at certain things, like story telling, humor and multi-turn conversations. Many mergers like Bruins don't have notable blind spots, but they also loose something (e.g. can't solve logic problems or code as well as their parent models). In short, Bruins is more balanced, but technically isn't any better, and it's not near as good as 72.6 Mixtral Instruct which got most of my hard questions right, while all Mistrals, including Bruins, got nearly all of them wrong.

It's not about whether a model perform well or not; I have no doubt benchmark data also makes excellent training sets. Still, in a system of (implied) absolute ranking, we must make sure that the models compete on even grounds, at least. Contamination takes away from that objectivity, and therefore will warrant a disclaimer.

I also agree that flagging a model and hiding them behind a filter may gather too much suspicion and negativity around them. I struggle to think of a fairer way to represent them, however.

I'd personally propose a filter for model family, tracing back and tagging for the earliest fine-tunes that was NOT a merge, and/or those debuting novel techniques as published in self-released papers (merges should otherwise have these base fine-tunes as tags). Models like Starling or Openchat, despite being contaminated, are very robust and respectable. They should, ideally, receive some visibility on the Leaderboard. I feel strongly against the lack of credit we've given those teams on this platform, burying their painstaking efforts under layers upon layers of derivations. The visibility from having your models acknowledged upfront in the UI should go a long way.

Nice, excellent stuff. Self-Serving Contamination Tests... and most importantly, we know one that is not contaminated.. SOLAR. Enjoy UNA-SOLAR alaready avaialble in the hub.

@Phil337 > Presented with 10 photos of UFO's, 1 of them is real, 9 of them are fake... 10 are fake :) Basics of manipulation & information .. There is one guy who wrote smth like "i dont care scores or whatever, I have a task, I have my tests based on my tasks and thats what matters to me".

I personally think that if everyone around, dumping in some row's .. can compose a decent evaluation dataset..
But most importantly, we need to come out with SOLID mechanisms to disarm contaminated models during evals.. Believe it or not.. there are here many victims and the responsible is not even in the board.. 10 UFOS :)

@clefourrier @SaylorTwift In my opinion more models need to be flagged. I might be wrong but here is a list along with mapping the root cause:

  1. EmbeddedLLM/Mistral-7B-Merge-14-v0.2
    Merged version of janai-hq/trinity-v1 which is already flagged and contaminated.

  2. AIDC-ai-business/Marcoroni-7B-v3
    Mysterious DPO of Q-bert/MetaMath-Cybertron-Starling which is merge of fblgit/una-cybertron-7b-v2-bf16 a version of which is already flagged and contaminated. Also check https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard/discussions/444. I also suspect GSM8K contamination.

  3. Toten5/Marcoroni-neural-chat-7B-v1
    Merge version of AIDC-ai-business/Marcoroni-7B-v3

  4. Toten5/Marcoroni-neural-chat-7B-v2
    Merge version of AIDC-ai-business/Marcoroni-7B-v3

  5. mindy-labs/mindy-7b
    Merge version of Toten5/Marcoroni-neural-chat-7B-v2

  6. jan-hq/supermario-v2
    Merge version of AIDC-ai-business/Marcoroni-7B-v3

  7. jan-hq/supermario-slerp
    Merge version of AIDC-ai-business/Marcoroni-7B-v3

Any update on this?

Hugging Face H4 org
โ€ข
edited Dec 20, 2023

Hi @rishiraj ,

Please don't re-ping us for a question you asked less than a day ago, we're doing our best to do things fast but we don't own timeturners (yet ^^).

@SaylorTwift , leading our contamination detection efforts, is off on Tuesdays, so he'll come back to it when he has the time after catching up.

Hugging Face H4 org

@Mino24she I really like your idea about displaying "parents" of model lines more prominently - not sure how feasible it is but I'll keep it in mind

I feel like the discussion has somewhat expanded beyond the original context.
Thus, I would like to close this discussion to centralize discussions on general data contamination checks to discussion 472 or discussion 265.

killawhale2 changed discussion status to closed

Hi ! Thanks @HDiffusion for your work ! I tested models for contamination and found that they are indeed contaminated on at least truthfulqa. To do so we used a tool to easily test for contamination, it is still a WIP and therefore not yet available but here is a screen of the results:

Screenshot 2023-12-17 at 17.19.12.png

@SaylorTwift Great work.
I tried running this Detect Pretrain Code Contamination gsm8k feature fork
jan-hq/trinity-v1 with ref model mistralai/Mistral-7B-v0.1


GSM8k
result < 0.1, %: 0.95

Can someone here flag the following model:
dillfrescott/trinity-medium
It's based on the flagged trinity model from jan HQ.
The creator wasn't aware that the original model was flagged

CultriX/MistralTrix-v1 is apparently based on zyh3826/GML-Mistral-merged, which is a merge of quantumaikr/quantum-v0.01 and mncai/mistral-7b-dpo-v5, both of which are flagged with a link here.
C.f.: https://huggingface.co/CultriX/MistralTrix-v1/discussions/6
Hence, CultriX probably should also be flagged?

Hugging Face H4 org

Hi @Venkman42 and @kno10 ,
Can you open specific discussions for model flagging so it's easier to trace flags?

@clefourrier I do not understand where to correctly propose flags. I could not even find a working search to see if it has already been mentioned. Huggingface "discussions" are a mess and not up for this task in my opinion.

Contamination detection should be automated anyway, then this will no longer be necessary.

Hugging Face H4 org

@kno10 The simplest for us is if you open a new discussion, that you call [FLAG] Model name for reason X that way we can redirect users to sub-discussions depending on the flag.
I agree that searching in discussions is not trivial ^^""

For contamination detection, we have an effort ongoing, but it's still an open research problem - we are exploring different strategies with different research teams in order to find the most efficient tool.

Sign up or log in to comment