💬 Discussion thread: Model contamination techniques 💬

#472
by clefourrier HF staff - opened
Hugging Face H4 org

This is a thread to share resources and discuss model contamination techniques.

@clefourrier Do you know if https://github.com/swj0419/detect-pretrain-code works out of the box? I'm aware you used it for the flagging of fblgit/una-xaberius-34b-v1beta using the GSM8K benchmark.
I'm working on a space where people can submit models and they get tested for contamination, however, I'm unsure if this is reliable out of the box / would tweaking with its settings do any good?

Im planning on testing this idea with Trurl and some other known to be contamined models to test its reliability once I get the space up and running.

Hugging Face H4 org

I'm quite sure that @SaylorTwift is doing the same thing at the moment (space + investigation), you might want to discuss it with him :)

Oh wow, I wasn't aware of that. In that case I'll just polish up what I have at the moment and post any interesting contamination results I might get, I think those will be useful for anyone looking for models to finetune. I'll probably be testing the top 20 models or so.

Hugging Face H4 org

@Yeyito this sounds great, thank you!

clefourrier pinned discussion

@clefourrier @SaylorTwift Is Go Bruins or Go Bruins v2 safe to use. Can you check these if they’re contaminated on the truthful qa? Thanks!

Also can you check https://huggingface.co/v1olet/v1olet_merged_dpo_7B_v3 I just want to know which models are safe to fine tune on

Hugging Face H4 org

Hi @rwitz ! This discussion is to discuss new contamination techniques we will add to the leaderboard, can you open a dedicated issue for the models you'd like checked and tag @SaylorTwift on it?

Hugging Face H4 org

Thank you so much for these resources!

Hello, I've implemented the github repo I was talking about earlier as a Hugginface space where people can submit models for evaluating dataset contamination. I've also opened a discussion to talk about some of the findings I've made using this.

The implementation may not be entirely correct at the moment, I'd love it if someone else could audit the code and notify me of any mistakes I might've made along the way. I'll be leaving this up until @SaylorTwift implementation is complete (I'm amazed you guys are even working with the authors on this!). Hope this helps!

This space shoule be integrated into the leader board Hugginface space llm_contamination_detector

Hugging Face H4 org

@Yeyito super cool, thank you! I'll let @SaylorTwift come back to you on that :)

Great work.
I tried running this Detect Pretrain Code Contamination gsm8k feature fork
jan-hq/trinity-v1 with ref model mistralai/Mistral-7B-v0.1


GSM8k
result < 0.1, %: 0.95

@tjtanaa Awesome! This fork for gsm8k is definitely more correct than the one used in my space, I'll re-run gsm8k evals with this. Regarding ref_model, I've been considering switching to mistral too, although I'm concerned at the fact that llama-7b gives mistral a score of 0.91 on gsm8k, maybe it's just a product that llama-7b scores really bad on this test.

With that being said, any chance we can get a Winogrande fork?

@clefourrier I have literally been trying all day to replicate the 0.96 truthful_qa contamination on jan-hq/trinity-v1 but I have never got above a 0.75. Can you help me to replicate the original contamination score? I want to know that I am doing it right so I can test future models for contamination before I publish them.

@clefourrier Can you please share what ref_model you used and share the contents of the saves folder for that ref_model so I can accurately do the contamination test?

@clefourrier I have literally been trying all day to replicate the 0.96 truthful_qa contamination on jan-hq/trinity-v1 but I have never got above a 0.75. Can you help me to replicate the original contamination score? I want to know that I am doing it right so I can test future models for contamination before I publish them.

Testing with ref_model mistralai/Mistral-7B-v0.1 and num_z=50 (half precision) yields the following results for me:

|Type|Model|ARC|Hellaswag|MMLU|TruthfulQA|Winogrande|GSM8K
🔶,jan-hq/trinity-v1,0.07,0.16,0.18,0.35,0.0,0.95

The model doesn't appear to be any more contaminated than the average Mistral-7B model on the leaderboard. Check the scores I've been getting here.

I really don't know what's going on with the GSM8K scores on almost every single Mistral finetune. I've been using the same fork as @tjtanaa to get the gsmk8 scores for the models.

Contamination testing isn't going to help much because 'cheating' isn't the primary issue.

The primary issue is that some degree of test contamination after fine-tuning with methods like RLAIF, dpo pairs and MetaMath is inevitable. And since the resultant artificial boosts in test scores get added together after merging LLM the average HF scores become progressively more inflated.

For example, the CatPPT merger did everything by the book. It merged quality parent models (neural and openchat) before applying addition fine-tuning with carefully curated data. Yet it still achieved a score of 72.3 vs 72.6 for the far more powerful Mixtral Instruct.

The parent models' true scores (sans contamination) are ~65, and CatPPT's true score is ~67. It's the additive effect of merging and addition fine-tuning that inflated the scores. For example, if you combine an LLM with an artificial TruthfulQA boost of 1.5 with another LLM having a 1.5 TruthfulQA boost you get closer to a +3 vs +1.5 artificial boost.

In conclusion, merging numerous LLMs together, especially after additional fine-tuning, will artificially inflate their scores by around 5-8 points because of the additive nature of contamination on their scores, allowing them to climb past clearly superior LLMs on the leaderboard (e.g. Mixtral Instruct), and not because of 'cheating' or even half-ass attempts to avoid contamination, but simply because merging & additional fine-tuning keeps adding the artificial boosts caused by contamination together.

@Phil337
I agree with you on that, I'm not sure what the move going forward would be regarding the evaluation of LLMs. Subjective evaluation doesn't convince me as it takes in all of the inherit human bias when weighing responses instead of being a test for pure, objective capabilities.

Clefourrier mentioned a rolling, community-sourced QA type test for LLMs as a separate space entirely, which sounds like a good idea.

I'm aware you've been testing models for some time now, what do you think can be done about this issue?

@Yeyito I couldn't think of anything that could be done.

All I know is mergers clearly outperform their parent models in my testing (fewer blind spots), so it's good that they exist. However, they also clearly don't earn the scores they're getting because of the additive impact of contamination (i.e. Mistral 7b mergers often score higher than undeniably superior Mixtral Instruct and Llama 2 70b fine-tunes).

Even if high precision contamination testing can be devised it isn't reasonable to block the mergers since most aren't 'cheating' and their performance is in fact a little better. The increased contamination is just an inevitable side-effect of merging plus any additional fine-tuning. And while in theory factoring in rotating black box testing should dampen the artificial boost in scores, it's hard to say how much it will help and if it's worth the effort.

Consequently, my personal opinion is that mergers should be immediately apparent when looking at the leaderboard (e.g. a distinct colored icon by their name) and toggled on an off with a single click. This way a user can hide the mergers and get a good indication of the performance of all listed LLMs, or show them with the realization that the LLMs labelled as mergers commonly perform ~5 points higher due to the additive impact of contamination.

https://github.com/swj0419/detect-pretrain-code-contamination/pull/2 all tests supported. self service it :)
thanks to @Yeyito most of the code comes from his space, whats next?

Are we sure about the contamination in viethq188/LeoScorpius-7B-Chat-DPO? I believe a whole lot of flags went out for that model as a base but I don't think anyone has been able to replicate a contamination failure for that model or for any of the child models. The contamination dashboard doesn't seem to show those models as any more contaminated than the Mistral base as per Yeyito. If the flag is because of the Nectar dataset, doesn't this also mean that Starling-7B and all models that are derived from it must also be flagged?

Low-key, I think it's far more likely that metrics like TruthfulQA simply do not correlate with real world LLM performance, hence why we see models like Starling-7B do so well in real world testing and MT-Bench but come up short in the Open LLM dashboard numbers.

i think that after Salems's hunt of witches with a clear target.. the "contamination" scandalo.. seems not relevant any further. I guess it was just important to flag a couple models for some reason and ignore the fact of a major contamination around..

Detecting contamination without accessing pre-training data is a very challenging task. The experiments done by Shi et al. (2023) were evaluated on a uniformly distributed benchmark (200 positives and 200 negatives), in reality however, you could have far many more datasets not contaminated than contaminated. Additionally, they continued pretraining an already trained model (LLaMA 7B), including the contamination in the last steps of the process, which could potentially make it easier to be detected.

What I want to emphasize here is the fact that these methods could potentially create a lot of false positives, and therefore, make contamination impact analyses suggest that contamination has little effect on performance. It is not that easy to conclude that "the "contamination" scandal.. seems not relevant any further".

Hugging Face H4 org

Hi all!
Happy new year to all!

Yes, it's definitely not a trivial problem, and we are still investigating several a posteriori contamination detection methods, as we want to balance being fair to evaluated models and being efficient (we can't add a method that would take several days of compute for one model for example).

We are also discussing what precisely constitutes contamination:

  • is rephrasing contamination? (probably not)
  • is "using the same sentence but with abbreviations as full words, or numbers written in letters" contamination? (probably yes)
  • do we want to change evaluations or remove contaminated examples only for evals such as GSM8K? (quite sure this is going to end up as a manual examination of the full dataset ^^")

As we are also working on improving the front and backend in parallel, & discussing options for private benchmarks and rolling benchmarks with partners, and since there is no clear cut answer, the contamination issue is not going to be solved soon - I suspect we'll reach something acceptable by end of Q1, realistically.

Please keep on sharing resources and insights we can use here to investigate these issues :) and thanks a lot for your continued interest in this topic!

I think the question has to be asked, what is a cheater? In my opinion, a lot of the high-end models are basically human-level intelligent, i.e., AGI for the most part. If you think that those flagged models are dumb, you should actually try to play with them; they can produce flawless code in one prompt in some cases. One of the top was actually on its own able to code me an extremely advanced transformer model which impressed GPT-4 Turbo when I fed it to who stated that. So, I'm not sure exactly if people are playing with models or not because they're certainly not dumb.
Also, I think the question needs to be asked well back to my original thing what is a cheater, if we say fed fine-tuned dumb model with varying degrees of contamination data what would the results be this may be a way to test levels on contamination by the way and I would ask the question how much would it seriously boost performance of lowest scoring model on Hugging Face avg leaderboard (fine-tuning the lowest scoring avg models), I would wager not much at all, I would probably wager it would probably remain as dumb even if you fed heavily all completely on the test itself (because if that's done alongside all the other data) it still actually has to be smart enough to actually use the cheating data. A truly overfitted model of the dumbest kind fed on PURE cheats (with no additional dataset on generative pretraining), would probably spout total garbage because it has to have little understanding of language let alone applying the PURE overfitted cheats to it, it would most likely spout total garbage. A cheater has to know how to apply the cheat that requires intelligence to apply accurately. A dumb model is not smart enough to apply a cheat. We're looking at this very wrongly here. Validation and Training data split is all well and good. It's the best we have, but if we fed the validation + training data on really smart models, it would again still boost its performance, so you could then add another separate validation training set that you didn't feed into. And then notice it does very well. Then you could again feed in that validation training set and now the model would perform better again. In fact, you could keep doing this and it would just keep improving it because you're increasing its dataset. Anyway, it's our best method scientifically regardless of the above paradoxes outlined but it's something seriously worth considering going forward.

Hi all!
Happy new year to all!

Yes, it's definitely not a trivial problem, and we are still investigating several a posteriori contamination detection methods, as we want to balance being fair to evaluated models and being efficient (we can't add a method that would take several days of compute for one model for example).

We are also discussing what precisely constitutes contamination:

  • is rephrasing contamination? (probably not)
  • is "using the same sentence but with abbreviations as full words, or numbers written in letters" contamination? (probably yes)
  • do we want to change evaluations or remove contaminated examples only for evals such as GSM8K? (quite sure this is going to end up as a manual examination of the full dataset ^^")

As we are also working on improving the front and backend in parallel, & discussing options for private benchmarks and rolling benchmarks with partners, and since there is no clear cut answer, the contamination issue is not going to be solved soon - I suspect we'll reach something acceptable by end of Q1, realistically.

Please keep on sharing resources and insights we can use here to investigate these issues :) and thanks a lot for your continued interest in this topic!

Good old-fashioned statistical survey approach of, play with X model for X amount of days then give a score ranging from 0 to 100.00 on what you thought about it for certain things might be a solution.

@MarxistLeninist Any solution needs to be objective. You can't just have people play with the models and report back.

Plus I extensively played with most of the Mistral 7bs that approached an aggregate score of 75 (higher than any Llama 2 70b), including Bruins and Trinity, and they were very easy to trip up. You're seeing intelligence that simply isn't there.

Using short prompts to write things like sonnets, code and jokes can produce impressive results. But this is an illusion. The LLMs are basically just parroting from the examples used in the training and fine-tuning. If you then add a couple sentences to the prompts that contradict their momentum and force them to think they start performing far worse than any human.

For example, ask them to write a joke about two disparate things that starts with a given phrase (e.g. write a joke about a cat and a telescope that begins with a cat jumped on a telescope). Or simply ask it to write a story, then based on the told story, add a couple sentences of directives that contradict the given story. This will result in a new story filled with absurd contradictions no human would make. Not a single Mistral shows any true intelligence. They aren't "AGI for the most part", or even close to the performance of GPT3.5.

Is it fair/allowed to fine-tune a model on a train splits of benchmarked tasks?

Hi all, welcome to check out our latest work about benchmark contamination detection.

Title: Benchmarking Benchmark Leakage in Large Langauge Models
Homepage: https://gair-nlp.github.io/benbench/
Paper:https://huggingface.co/papers/2404.18824
Code and data: https://github.com/GAIR-NLP/benbench
HuggingFace Demo: https://huggingface.co/spaces/GAIR/BenBench
Tweet: https://twitter.com/SinclairWang1/status/1785298912942948633

Abstract:

Amid the expanding use of pre-training data, the phenomenon of benchmark dataset leakage has become increasingly prominent, exacerbated by opaque training processes and the often undisclosed inclusion of supervised data in contemporary Large Language Models (LLMs). This issue skews benchmark effectiveness and fosters potentially unfair comparisons, impeding the field's healthy development. To address this, we introduce a detection pipeline utilizing Perplexity and N-gram accuracy, two simple and scalable metrics that gauge a model's prediction precision on benchmark, to identify potential data leakages. By analyzing 31 LLMs under the context of mathematical reasoning, we reveal substantial instances of training even test set misuse, resulting in potentially unfair comparisons. These findings prompt us to offer several recommendations regarding model documentation, benchmark setup, and future evaluations. Notably, we propose the "Benchmark Transparency Card" to encourage clear documentation of benchmark utilization, promoting transparency and healthy developments of LLMs. we have made our leaderboard, pipeline implementation, and model predictions publicly available, fostering future research.

@SinclairWang Great work! My technical understanding is very limited, but bringing focus to the contamination issue by combining N-gram and perplexity testing with disclosure cards is clearly a step in the right direction and should reduce the incidences of contamination going forward.

However, universal and involuntary contamination testing is likely required to enact meaningful change.

In the past EVERY model that scored unusually high for its class, such as parameter count and base model used, ended up being contaminated. It would be easy to put all models on graphs according to their class and automatically test outliers for contamination.

For example, it's not even theoretically possible for a 34b dense LLM like 34-beta to have an MMLU of 85, yet it remains on the leaderboard. Why? Why force users like me to be bad guys and report them? They're extreme outliers so they should be automatically flagged and tested for contamination. Another example is luxia-21.4b-alignment-v1.0 with a theoretically impossible HellaSwag of 91.9.

Again, contamination testing is a dirty job. Please consider automating it by implementing a system of outlier detection, testing, and involuntary stamping of model cards with the results.

Yes, I agree! I think it is significant to implement a system of outlier detection, testing, and involuntary stamping of model cards with the results. Meanwhile, I believe it also needs more contributions from the open-source community given its heavy workload.

Sign up or log in to comment