The Unintended Consequences of Model Merging in the AI Landscape: A Case Study of the Marcoroni-7B-v3 Model

#544
by rishiraj - opened

In the dynamic world of AI language modeling, the HuggingFace Open LLM Leaderboard has long stood as a benchmark for evaluating the efficacy of various models. Traditionally, researchers and developers dedicated considerable effort to Supervised Fine-Tuning (SFT) and even Directed Preference Optimization (DPO) to enhance their models' performance. These methods, rigorous and time-consuming, often yielded significant improvements, as reflected in the Leaderboard's metrics.

However, the landscape began to shift with the advent of SLERP (spherical linear interpolation) for merging language models. This technique generally produced models superior to their individual constituents, a development that was initially welcomed by the community. Unfortunately, this period also saw the emergence of models contaminated with test data, notably some 7B models that scored unusually high on the Leaderboard.

The Open LLM Leaderboard team responded swiftly, flagging and removing these contaminated models. Their efforts are commendable and detailed in this discussion. However, amidst this cleanup, one model, AIDC-ai-business/Marcoroni-7B-v3, slipped through the cracks. This model, a DPO tuning of Q-bert/MetaMath-Cybertron-Starling, which in turn was merged from a version of fblgit/una-cybertron-7b-v2 (v3 version of which is already flagged for contamination and both versions used a common dataset), somehow avoided detection. Further suspicions of GSM8K contamination were raised in another discussion, prompting me to raise an issue, particularly as most other contaminated models had been flagged. My concerns can be found here.

The delayed response to this particular model had far-reaching consequences. The community, unaware of the contamination, used AIDC-ai-business/Marcoroni-7B-v3 extensively for further merges. Even though this model has now been removed from Hugging Face (yet, notably, still not officially flagged), its legacy continues. Numerous models derived from it still populate the Leaderboard, creating a cascading effect that undermines the integrity of these metrics.

This situation highlights a critical issue: the overshadowing of genuine, contamination-free efforts like openchat/openchat-3.5-0106, argilla/distilabeled-Hermes-2.5-Mistral-7B and others. These models represent the diligent, ethical approach to AI development, yet risk being lost in a sea of contaminated derivatives.

The key takeaway from this episode is not to cast blame on those who unknowingly used contaminated models but to raise awareness about the critical importance of vigilance and ethical practices in AI model development and sharing. As the field grows, so does the responsibility of each participant to ensure the integrity of their work, for the betterment of the entire community.

On a side note, back when the first Marcoroni models were released (llama models), they were also flagged for deleting community discussions, failing to credit users, and resetting the Git history to hide their work. However, this doesn't prove that the v3 model should not be trusted (IMO it shouldn't but it's just a personal opinion of mine); it is just a side note for the community.

Neither of the models you mentioned are currently flagged. fblgit/una-cybertron-7b-v3-OMA was flagged but the v2 version was not. This isn't to say that they aren't contaminated, but just be careful making accusations. Personally I think Marcoroniv3 is very suspicious and the fact that it's been deleted doesn't give them a lot of credibility.

(v3 version of which is already flagged for contamination and both versions used a common dataset)

@HDiffusion thanks for pointing out, I've made the required changes to my original text. Also the goal here is not to make accusations, it's just to raise awareness.

I've just browsed the best scoring 7B models and it turns out that most are merges of contaminated models. Unfortunately the leaderboard looks more and more like a Kaggle leaderboard with a leaked test set. Hopefully the authors are aware of the limitations of the models they create, they just wanted once or twice to see how merges perform and they'll not continue to do it to avoid polluting the leaderboard.

Neither of the models you mentioned are currently flagged. fblgit/una-cybertron-7b-v3-OMA was flagged but the v2 version was not. This isn't to say that they aren't contaminated, but just be careful making accusations. Personally I think Marcoroniv3 is very suspicious and the fact that it's been deleted doesn't give them a lot of credibility.

Hopefully fblgit/UNA-TheBeagle-7b-v1 is not contaminated, because it'd mean mlabonne/Beagle14-7B and mlabonne/NeuralBeagle14-7B are, and also many more models on the leaderboard since Labonne built amazing course materials on building LLMs and thus many people are building on his models.

After discovering how many models appear to be contaminated in the leaderboard, I was actually surprised to find out one of the pinned discussion is entitled Discussion thread: Model contamination techniques, probably participating to the contamination instead of avoiding it.

I'll read the previous discussion in the next days to see what it says. Most probably Hugging Face would need to add a contamination evaluation to the leaderboard to build trust and to be fair with authors being transparent on the contamination of their models and seeing their models being flagged, whereas some contaminated models are not flagged because we don't know what they are built on.

https://www.reddit.com/r/LocalLLaMA/comments/19acvq2/huge_issue_with_truthfulqa_contamination_and/
mlabonne/Beagle14-7B and mlabonne/NeuralBeagle14-7B are definitely contaminated via argilla/distilabeled-Marcoro14-7B-slerp which is contaminated via mlabonne/Marcoro14-7B-slerp and then EmbeddedLLM/Mistral-7B-Merge-14-v0.1 and AIDC-ai-business/Marcoroni-7B-v3.
There is no proof of fblgit/UNA-TheBeagle-7b-v1 being contaminated yet, but it looks suspect.
Also as HuggingFaceH4/ultrafeedback_binarized is contaminated (I do not see any notice on dataset's card yet), fblgit/una-cybertron-7b-v1-fp16 is too. There are also several models fine-tuned on nectar, including HF's HuggingFaceH4/zephyr-7b-beta
There are also allegations about Intel/neural-chat-7b-v3-3 being contaminated too. https://huggingface.co/Intel/neural-chat-7b-v3-3/discussions/4
All of this would mean that potentially hundreds of models are contaminated.

Look guys.. there is the contamination tool that wonky works.. we have this dilemma about the math.. but look, a bunch of samples of a gsm test doesnt give you 2 whole points on ARC. So lets look at the models in a holisitcal way.
Being fair, If u flag Intel neural Chat 7B V3.. u will have to flag all models between pos #1 and pos #50 ~ The GSM is just.. dead. Now, look at the ARC's and the others.. the solution is not on flagging.. but on replacing GSM8K test for something else and rerun it across the models and follow that.. if u flag the top 50 models of 7B because of that Math contamination.. u will be doing a very big damage because there is a real AI intelligence improvement across these models and its not just "numeric".. users & consumers knows this.

From my perspective, i have no issue at all.. i just get another great model and UNAfy it and we are back on top.. as usual. but IMHO its about time to stop salem's witch hunting of models.. and fix the eval that seems so troublematic :)

Peace.

Thanks @MichaelKarpe !

Personally, I think that all of these models (including mine of course) should be flagged. I'm embarrassed I inadvertently spread contaminated models and I'm trying to rebuild this merge pyramid with healthier foundations (e.g., mlabonne/Darewin-7B).

I do agree with @fblgit though: these benchmarks (GSM8K and TruthfulQA) are bad and shouldn't be used. The current merges seem to perform better than the non-contaminated models too. But we don't want to remove these models from the Hub, just from the default leaderboard view.

@mlabonne I appreciate your willingness to accept the issue. It can happen to everyone, including apparently to Intel team with neural-chat. There are also documented cases of base models being contaminated.
Fine-tuning datasets should be thoroughly vetted, and clearly labeled as contaminated. I think some effort should be directed to making automated tools for checking datasets for contamination.
HF is at least partly to blame this time, as issue was know for long time, but no action was taken, and it was allowed to spread.
I think all contaminated models should have warning in the model's card explicitly stating the issue and non-eligibility to the board. Also, for fine-tunes and merges dataset and base models should be disclosed for board eligibility.
I think future datasets can have another split, test-private, which is never published. Results on this split can be tested against results on test split, and if there is discrepancy, possible contamination should be investigated.

Open LLM Leaderboard org

Hi all!
Thanks a lot for this interesting discussion, and thanks a lot for having taken the time to write this analysis @rishiraj .

It's a good highlight of the problems which can arise when a model, well performing but possibly contaminated, tops the leaderboard and is widely adopted almost instantaneously by the community.
In my opinion, this is a problem of metadata: model adoption should be linked to correct model information sharing practices, not only leaderboard scores.
We added more filters at submission (forcing the existence of a model card, a license tag, ...), but there are no good ways at the moment to enforce the declaration of fine-tuning sets, or parent models, in the metadata. If you have good ideas about this, we'd love to add more (technically possible) filters on the leaderboard!

Open LLM Leaderboard org

As a side note, we usually don't flag models that have been deleted a posteriori, since they no longer appear on the leaderboard. If you feel like a specific model needs to be flagged a posteriori, feel free to open a discussion where you tag the authors, and we'll investigate.
In general, it's good to keep in mind that flagging is an extra effort for the leaderboard maintainers, as we have to individually check all models reported - we always have to strike a balance between the speed at which we investigate difficult flagging checks and the features and leaderboard upgrades that we'll have to delay.
We're really hoping we go towards more and better metadata checks, so flagging becomes less and less of a need :)

I think @clefourrier has already provided a very detailed and sensible way out of this problem. Hence I feel closing this discussion would be right.

rishiraj changed discussion status to closed

just like the other half dozen conversations about this topic. nothing has been achieved. and there is no next step. Salem's.
Too much shoutcast how important is to keep the board clean.. blame metadata.. but at the same time allow willingly and knowingly contaminated models flagged by the same owner .. to spread like a virus across models.
This contamination seems to benefit some figures, so ambiguity and witch hunting is diverted to the agenda of a few's.

Sign up or log in to comment