Tool: Space to test model contamination

#486
by Yeyito - opened

Hello!
I've been using an implementation of this github repo as a Huggingface space to test for dataset contamination on some models. This is all based on this paper.

The scores I get may not be entirely accurate as I'm still in the process of working out the inaccuracies of my implementation, for instance, I'm confident the code is currently not doing a good job at testing for Winogrande contamination due to the abnormally low scores every model gets, I'll have to look into that.

I'm slowly testing more and more models and thought it'd be a good idea to make the code available to everyone, such that people can easily contribute and make their own versions of this.

Please do not use this to accuse models of cheating, instead, try to reproduce whatever you see here yourself, or use the scores you may get from these tests as justification for a more holistic analysis of dataset contamination on models. The code I use for obtaining these scores is the exact same as the one found in the space, you only need to uncomment two lines inside app.py inside worker_thread().

That being said, some interesting findings:

  1. mistralai/Mistral-7B-v0.1 gets a very high % on the GSM8K benchmark 0.91
  2. so does Q-bert/MetaMath-Cybertron-Starling with a 0.99
  3. rishiraj/meow is unrelated, as it's a finetune of SOLAR but still gets a very high GSM8K at 0.95

According to the authors of the original implementation: "The output of the script provides a metric for dataset contamination. If #the result < 0.1# with a percentage greater than 0.85, it is highly likely that the dataset has been trained."

These findings are consistent with: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard/discussions/444#657b2dfd69a46ce96bd59887

The eval is open for submissions, they might take a while to appear as I still don't have the process automated, I'll be performing evaluations depending on how much free time I might have on a given day. Hope this helps!

image.png

Which models should I test next? (At the moment I can only do up to 10.7B)

Yeyito changed discussion title from Model contamination scores from tests to A Hugging Face space to test for model contamination!

Which models should I test next? (At the moment I can only do up to 10.7B)

Yi-6B, phi-2 2.7B, and Qwen 1_8B 7B, what do you think?

@Yhyu13 sounds good! I'll make sure to update the eval with those scores tomorrow, as it's getting quite late for me.

Open LLM Leaderboard org
edited Dec 20, 2023

Very good job!

image.png

image.png

confusing results from upstage/SOLAR-10.7B-Instruct-v1.0 official report.

Yeyito changed discussion status to closed
Yeyito changed discussion status to open
deleted

@JosephusCheung Few things could be at play:

  1. Which implementation did they use for the GSM8K benchmark? I'm currently using this one: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard/discussions/444#657b287ef634e69165cbcd75

  2. They do not post which ref_model they're using, (The eval requires two models, the one you want to test and a reference model which you compare scores against) It could be possible that they're using a different ref_model like Mistral which would skew all benchmarks downard. Currently, I'm using huggyllama/llama-7b which is the one provided by the original implementation.

  3. I can run the test at a higher precision to get more accurate answers but that would take double the compute and from the tests I've done internally it's only a ±0.04 difference.


I believe the second option is at play here, but I'm not sure if they posted which ref_model they used to get those scores, @clefourrier do you have any idea?

@Phil337 Of course, I'll be testing the models in queue in a few hours as I just woke up, I'd say don't jump to any conclusions unless you see something like 0.98, 0.99 on the scores, as I'm still working out this implementation and seeing if huggyllama/llama-7b is indeed the best ref_model for this.

image.png

Happy to see the queue is filling up quite fast! 🥳 (you can check it by clicking on 'logs' at the top left for now)

This is an impressive contribution! It's remarkable that support is also provided for HellaSwag and Winogrande!

This is an impressive contribution! It's remarkable that support is also provided for HellaSwag and Winogrande!

I'm pretty sure the Winogrande implementation is incorrect, I'll try to sort that out today if I've got the time, I'd love it if someone could open a PR for it (detect-pretrain-contamination/src/eval.py -> process_winogrande()).

I've got a lot of work to do before this space is 100% reliable, hope you guys understand if evals are a bit slow right now. Thank you for the kind words!

deleted

@Yeyito Is there a reason why the Llama 2 7b foundational model makes for a better reference model than the Mistral 7b foundational model when testing Mistral fine-tunes like Marcoroni?

That is, how is potential contamination during the fine-tuning of Marcoroni distinguished from potential contamination detected in Mistral base? Can the two results be compared to detect the additional contamination during fine-tuning? Or would using the Mistral base as a reference model when testing Mistral fine-tunes more accurately detect contamination added during the fine-tuning process?

@Phil337
I was using llama7b since it was the default ref_model provided by the paper's implementation. There's no particular reason why a ref_model should be used over another, as it is a comparative evaluation, it all depends on what we're testing for. We generally assume the ref_model has no contamination and in this regard I trust llama more. However, I agree that a Mistral ref would be more accurate for detecting contamination in Mistral finetunes over the base model.

This gives me the idea of adding a column to the eval specifying which ref_model was used for a particular evaluation, from now on I'll also be testing finetunes against their base models and base models against llama-2 7b. I'll try to have these changes ready by tonight, thank you for pointing that out! 😄

Mistral ref would be more accurate for detecting contamination in Mistral finetunes over the base model.

I don't think the benchmark contamination is somewhat additive. Maybe you just annotate the name of their base models and they will be automatically attributed.

@JosephusCheung Can you elaborate on what you mean by that?

@Yhyu13 @Phil337 @JosephusCheung
Most of the models you guys submitted now have their evals up.

A few things that have changed:

  1. I implemented this fork for testing GSM8K. Scores on that test should be more accurate.
  2. I've been using the mistral base model as the default ref_model but you can change which ref_model you want the model you submitted to be tested with as it is now a parameter. What I believe @JosephusCheung tried to convey was: contamination scores are not a direct comparison to the ref_model. If we compare a contaminated model with a contaminated ref_model we should still get a positive result.
  3. I'm invalidating the Winogrande dataset contamination test for now, as I believe the implementation is broken. I was unsure of it since the beginning and seeing most models score 0.01 - 0.02 on it convinced me that I should work on a better implementation of this test before displaying scores publicly.

With that being said, a few interesting things:

  1. Most mistral finetunes score around 0.95 on the GSM8K test this is a direct product of Mistral-7B scoring a 0.91 and people further finetuning on GSM8K data and its paraphrases like MetaMath.
  2. Orca-2 scores incredibly high on every test which I find to be very peculiar.
  3. Some mistral finetunes like amazon/MistralLite and HuggingFaceH4/zephyr-beta score as less contamined than Mistral-7B on the GSM8K benchmark at around 0.7-0.8.

If for some reason a model that you've submitted doesn't appear on the queue or on the leaderboard, don't be afraid to re-submit it, as most of the process for now is manual and mistakes can happen.

Hope you make good use of this information! 🤗

deleted

@Yeyito Thanks! Like you said Orca-2 is interesting, especially with it's very high TruthfulQA. However, it seems kinda strange to make a test for the most popular and obvious falsehoods (e.g. the world is flat) and then not fine-tune that nonsense into oblivion.

@Yeyito I noticed you used Mistral 7b to test Orca 2 Llama 2 7b. Perhaps using Llama 2 7b as the reference model would produce different scores.

Edit: I wonder what would happen if Llama 2 7b was tested using Mistral 7b as the reference?

Edit: Thanks Yeyito. I re-submitted both with non-gated LLMs and they're in the queue. Sorry about submitting one twice.

@Phil337 Submit them! You can now change the reference model that is used. It's now one of the submission parameters.

I don't think I'll have the time to test them today though, its getting quite late. Besides, I've been testing models all day 😵‍💫

Edit:
Could not get the model config from the hub.: You are trying to access a gated repo.
Make sure to request access at https://huggingface.co/meta-llama/Llama-2-7b-hf and pass a token having permission to this repo either by logging in with huggingface-cli login or by passing token=<your_token>.

^ Sorry, I don't have gated repos working at the moment 💔, I think NousResearch has a non-gated clone of llama2, maybe try that?

Hey! everyone. All the MetaMathQA data is sourced from the GSM8K and MATH train sets. Our data augmentation process does not involve any data from the test set.
We have transparently disclosed the source of each data point, and you can verify it here: https://huggingface.co/datasets/meta-math/MetaMathQA
Additionally, we have disclosed the code for obtaining MetaMathQA data, which can be checked at: https://github.com/meta-math/MetaMath/tree/main/code_for_generating_data

The Metamath project is entirely transparent and open-source, encompassing code (for both data augmentation and model training), models (a range of models), and data (comprising all our data along with its sources). Anyone interested in contributing is welcome to join our HuggingFace. Again, We have never utilized any test data, and all our data and models are transparently and openly available: https://huggingface.co/meta-math

MetaMath is always eager to make more contributions to the open-source LLM, if you have any questions, we would be more than happy to help!

@Longhui98 I would like you to have some suggestions for contamination detection based on your experience. I think adapting the format to a specific task would lead to the result in high similarity, what do you think?

Unfortunately, under current conditions, detection of intentional contamination seems to be not practically possible.

deleted
This comment has been hidden
Open LLM Leaderboard org

Hi @Longhui98 ,
Super cool to get this feedback from you, the transparency is great! :)

Side question if you have the time, did you account for self contamination in MATH when building your dataset?
(Like lmsys reported in their contamination blog post)

Open LLM Leaderboard org

Edit:
To all users, let's move the discussion about MetaMaths to #265, since this discussion is specifically for the cool space @Yeyito created

clefourrier changed discussion title from A Hugging Face space to test for model contamination! to Tool: Space to test model contamination

Closing this discussion as the compute costs to test models are prohibitively expensive for me, I'll leave the already tested scores up indefinitely, but wont be testing models until I can access more compute in a sustainable way. Hope the space helped you in any way shape or form! 🤗

Yeyito changed discussion status to closed

Hi all, welcome to check out our latest work about benchmark contamination detection.

Title: Benchmarking Benchmark Leakage in Large Langauge Models
Homepage: https://gair-nlp.github.io/benbench/
Paper:https://huggingface.co/papers/2404.18824
Code and data: https://github.com/GAIR-NLP/benbench
HuggingFace Demo: https://huggingface.co/spaces/GAIR/BenBench
Tweet: https://twitter.com/SinclairWang1/status/1785298912942948633

Abstract:

Amid the expanding use of pre-training data, the phenomenon of benchmark dataset leakage has become increasingly prominent, exacerbated by opaque training processes and the often undisclosed inclusion of supervised data in contemporary Large Language Models (LLMs). This issue skews benchmark effectiveness and fosters potentially unfair comparisons, impeding the field's healthy development. To address this, we introduce a detection pipeline utilizing Perplexity and N-gram accuracy, two simple and scalable metrics that gauge a model's prediction precision on benchmark, to identify potential data leakages. By analyzing 31 LLMs under the context of mathematical reasoning, we reveal substantial instances of training even test set misuse, resulting in potentially unfair comparisons. These findings prompt us to offer several recommendations regarding model documentation, benchmark setup, and future evaluations. Notably, we propose the "Benchmark Transparency Card" to encourage clear documentation of benchmark utilization, promoting transparency and healthy developments of LLMs. we have made our leaderboard, pipeline implementation, and model predictions publicly available, fostering future research.

Sign up or log in to comment