Foundational Model?

#1
by deleted - opened

This shows up as a foundational model on the HF leaderboard with a green icon next to its name.

Am I missing something? A foundational model, as defined everywhere, including at HF, isn't a fine-tune, let alone a slerp merge. A foundational model is an unsupervised model made from a large corpus of data trained using millions of dollars in hardware over months, such as Mistral, or a non-fine-tuned modification like Solar.

This is just a slerp merge of two Mistral fine-tunes. It's no more a foundational model than the 100s of other Mistral mergers. Just because you did additional fine-tuning when making your CatPPL version doesn't change the fact that this CatPPL-Base is just a slerp merge.

Edit: Plus you should make it clear what neuralchat and openchat versions you used. And while I like them both, they are anything but free of contamination.

yes maybe the category has been wrongly marked. do you know a way how can it be corrected? i feel part of the reason it is marked as pretrained is because the resulting merge model was not finetuned for instruction/chat on any dataset and was acting more of a text completion model if you try using it.

the version of models being merged are Intel/neural-chat-7b-v3-3 and openchat/openchat-3.5-1210, both of which are free of contamination in my knowledge. if you feel otherwise, please raise an issue againt them in their own discussion and if any of them are found to be contaminated, I'll personally mark this model to be contaminated as well.

deleted

@rishiraj My concern about the pre-train mark is not with this particular merge, but with the floodgates it may open.

HF has a merger problem. It's being flooded by them and it's hard to identify them at first glance, or even when reading their model cards. And if they start labeling mergers as pre-trained the merger confusion will only get worse. So my advice would be to avoid labeling mergers as pre-trained going forward. Also, this model has SFT, DPO, RLAIF... from neural and chat, so they aren't just completing text like foundational models. They can be used like any other fine-tunes in 0-prompt contexts (e.g. Q&A).

Regarding the contamination, I have never personally identified contamination, nor know how to, but others have (see link). And for example, neural 3.3 is contaminated with GSM8K (MetaMath), and almost certainly with TruthfulQA data as well. And this matches my experience with both (openchat as well) and knowledge of how they were made (e.g. DPO and RLAIF). Such training methods inevitably get contaminated with TruthfulQA. MetaMath is respectable and transparent, but contaminated with GSM8K data. In short, the models you merged aren't deliberately "cheating", but there's a >95% chance that they have a significant amount of contamination (an artificial ~5 point boost in both TruthfulQA and GSM8K scores that don't reflect their true performance) . This is the main problem with mergers. They only perform about 2 points higher than their parent models, yet are scoring ~5 or more points higher due to the build-up of test contamination, putting them higher on the leaderboard than higher performing models like Mixtral.

https://huggingface.co/spaces/Yeyito/llm_contamination_detector

deleted changed discussion status to closed

Sign up or log in to comment