[FLAG] CultriX/MistralTrix-v1 is based on quantumaikr/quantum-v0.01 and mncai/mistral-7b-dpo-v5

#556
by kno10 - opened

CultriX/MistralTrix-v1 is apparently based on zyh3826/GML-Mistral-merged, which is a merge of quantumaikr/quantum-v0.01 and mncai/mistral-7b-dpo-v5, both of which are flagged with the link https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard/discussions/474.

Open LLM Leaderboard org

@CultriX could you add more details about your model so we can know if there are contamination risks?

Hi,

It's as described on the model page basically! I added that description below in this post.
TLDR: It's zyh3826/GML-Mistral-merged-v1 fine-tuned with Intels DPO dataset (Intel/orca_dpo_pairs).

Hope that clarifies things!
If you need/require anything else let me know and I'll try to answer your questions to the best of my abilities.
I am however an undeniable amateur when it comes to all of this, so I might not have all the answers!
Am just messing around basically, and this time it happened to turn out decently enough and produced a pretty nice model I suppose :)!

With regards,
-CultriX-

######## DESCRIPTION #########
MistralTrix-v1 is an zyh3826/GML-Mistral-merged-v1 model that has been further fine-tuned with Direct Preference Optimization (DPO) using Intel's dataset for neural-chat-7b-v3-1. It surpasses the original model on several benchmarks (see results).

It is directly inspired by the RLHF process described by Intel/neural-chat-7b-v3-1's authors to improve performance. I used the same dataset and reformatted it to apply the ChatML template.

The code to train this model is available on Google Colab and GitHub. Fine-tuning took about an hour on Google Colab A-1000 GPU with 40GB VRAM.

TRAINING SPECIFICATIONS
LoRA configuration peft_config = LoraConfig( r=16, lora_alpha=16, lora_dropout=0.05, bias="none", task_type="CAUSAL_LM", target_modules=['k_proj', 'gate_proj', 'v_proj', 'up_proj', 'q_proj', 'o_proj', 'down_proj'] )

Model to fine-tune model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.float16, load_in_4bit=True ) model.config.use_cache = False

Reference model ref_model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.float16, load_in_4bit=True )

Training arguments training_args = TrainingArguments( per_device_train_batch_size=4, gradient_accumulation_steps=4, gradient_checkpointing=True, learning_rate=5e-5, lr_scheduler_type="cosine", max_steps=200, save_strategy="no", logging_steps=1, output_dir=new_model, optim="paged_adamw_32bit", warmup_steps=100, bf16=True, report_to="wandb", )

Create DPO trainer dpo_trainer = DPOTrainer( model, ref_model, args=training_args, train_dataset=dataset, tokenizer=tokenizer, peft_config=peft_config, beta=0.1, max_prompt_length=1024, max_length=1536, )

For my understanding:

  1. You first stated the model was contaminated because it used a dataset (https://huggingface.co/datasets/Intel/neural-chat-dataset-v2) which it did not use as somebody pointed out in response to your post (see: https://huggingface.co/CultriX/MistralTrix-v1/discussions/6)

  2. However you then stated that the model might still be contaminated as it's a fine-tune of zyh3826/GML-Mistral-merged-v1 , which in turn is a merge of quantumaikr/quantum-v0.01 and mncai/mistral-7b-dpo-v5. If I understand you correctly you are saying that quantumaikr/quantum-v0.01 and mncai/mistral-7b-dpo-v5. are both flagged (which is true I checked) and therefore my model is contaminated as well.

To be clear: I am not saying that I know for sure that my model is not contaminated and that you are a liar, not at all!
I'm just trying to understand how you seem to know for sure the models my model is based on are indeed contaminated so that I can look out for similar problems myself should I try to make a new model in the future :)!

Thanks for your time!

-CultriX-

Open LLM Leaderboard org

Hi to you both!

@CultriX I understand where you come from! I'm trying to go back to the original flag of both these models - it's very likely that either the wrong conversation was linked there, or that a message flagging these models was edited and that we lost the information that way. I should make an excel sheet or a dataset of the flags, with the model name, reason, and flag author.

@kno10 Would you be so kind as to open discussions on the model pages for both these models, so we can ask the authors directly if their models were accidentally trained or fine-tuned on possibly contaminated data?

I am not accusing @CultriX of any intentional contamination. My interest is solely on reliable evaluation, and contamination harms the evaluation; and the performance of a lot of the current "merge" models appears due to contamination not actual improvements...
He thankfully shared a colab notebook that allowed seeing how the model was trained.

The initial assumption with neural-chat-dataset-v2 was wrong, as it had not been used. But the base model is zyh3826/GML-Mistral-merged-v1.

According to https://huggingface.co/zyh3826/GML-Mistral-merged-v1, this is a merge of quantumaikr/quantum-v0.01 and mncai/mistral-7b-dpo-v5. Hence the link to #474 above. This is all I could find, because the forums here do not have a working search, or any kind of tracking for these flags that I could find!

quantum.png

mncai.png

All I know about these two models is that they are "flagged" with a link to #474 in the leaderboard. I cannot provide further information. In #474 I was asked by @clefourrier to make a new issue for this model...

In #474 you can find a post https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard/discussions/474#6580425f16eb2b758e63c5c3 that shows that quantum-dpo-v0.1 is based on rwitz2/go-bruins-v2.1.1, which apparently is based on jan-hq/trinity-v1, which is apparently a merge of viethq188/LeoScorpius-7B-Chat-DPO, which apparently was trained on Nectar, which apparently contains TruthfulQA.

I could not see why mncai/mistral-7b-dpo-v5is flagged, my assumption is that it was flagged along with mncai/mistral-7b-dpo-merge-v1.1 (which is described as "merge mncai/mistral-7b-dpo-v6, rwitz2/go-bruins-v2.1.1, ignos/LeoScorpius-GreenNode-Alpaca-7B-v1, janai-hq/trinity-v1", two of which appear to be known as contaminated). Given the extremely close performance of the two, I assume that it was also contaminated, otherwise the merge would have made a larger difference?

Hence I suggest to test this model for contamination. Maybe it would even be best to pause the leaderboard (and hide all merged/finetuned models by default) until this mess is sorted out somewhat. The leaderboard has ceased to be useful without contamination testing.

Open LLM Leaderboard org
edited Jan 25, 2024

Opening a discussion here was definitely the correct thing to do, and thanks for having done that @kno10 !
As you highlighted, some of the parent models have been flagged, but there is no easy way to go back to the why, as they are not mentioned in the current version of the linked discussion. As explained above, this can be due to either an error on our side where we linked the wrong discussion, or to someone editing their comment afterwards.

To investigate if we should keep the flag of the parent model (while we look in the discussions archives), I opened a discussion here to ask the quantum authors about their model.

I want to say thanks to the both of you for the respectful tone of this conversation!
For the record I fully agree with you @kno10 and I hope my initial response did not come across as me feeling attacked by you cause I don't.

Let me be clear: If my model is indeed contaminated it's only fair that it gets flagged as such and it would indeed be bad for the general community and way forward if it was overlooked. Therefore I am thankful for you actually looking into that stuff, as I must honestly admit that I had absolutely no idea the data might have been contaminated in the first place (it's becoming really hard to keep track, especially with all the merges and the apparent recent increase in confusion amongst the userbase).

That said I must also admit that it's a little bit annoying that it's now flagged (hence not appearing) based on the assumption that it's based on a model that is based on models that are presumed to have used contaminated data in one way or the other (even though the "proof" for that, which is that it's flagged, links to a discussion that does not mention either model a single time).

Now as @clefourrier so generously pointed out there could be multiple reasons as to the why that is, and that seems like that could very well be the case here.
However until that is sorted out it kind of seems like we are utilizing a "guilty until proven innocent" approach, whilst I personally would've preferred an "innocent until proven guilty" one.

Thanks for looking into this to the both of you though!
With regards,
-CultriX-.

Open LLM Leaderboard org

HI @CultriX ,

We are following an "innocent until proven guilty approach", by the way - your model has not yet been flagged, the only reason why don't see it appear in the main view is because it's a moerge and we don't display these by default (as they don't interest the broader part of the community) :)
You'll find the results by unchecking the "hide merge and moerges" model checkbox

image.png

@CultriX you cannot use Ctrl+F to find the reference to quantum-dpo-v0.1, because it is only in a screenshot: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard/discussions/474#6580425f16eb2b758e63c5c3

For all I can tell from the linked threads, there is a documented inheritance via fine-tuning/merging:
CultriX/MistralTrix-v1 -> zyh3826/GML-Mistral-merged -> quantum-dpo-v0.1 -> rwitz2/go-bruins-v2.1.1 -> jan-hq/trinity-v1 -> viethq188/LeoScorpius-7B-Chat-DPO -> Nectar/ultrafeedback-binarized -> TruthfulQA.

That is probably what would be considered "guilty of data contamination"?

Thanks again for the continued time invested and explanations provided! If that is the case I agree with the stated assumption. However, it does raise curiosity just how many of the models on the board would have some sort of contaminated data in them if one dives deep enough into the models heritage...

@clefourrier thank you for clarifying that, but I was aware of the fact merges do not show up unless you uncheck that box! It seems that I was misguided however by this other issue that I am experiencing where pretty much none of my models show up unless I uncheck the "hide deleted/private" checkbox as well, even though the models are very much publicly available?

(Note: I actually made a separate post about the issue so it's a little bit off-topic in regards to this thread. But since I caught you here and it's actually not resolved yet I figured you might know why this is occurring? The reason why I'm mentioning it now is because it gave me the false impression my models weren't showing up due to being flagged even though I now realise that was not the case. Sorry for my grumpy "guilty until proven innocent" remark that was a bit uncalled for in hindsight... As I don't want to change the topic of this discussion any more than this, here's the link to the one I opened on the issue: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard/discussions/560. )

Cheers!

Open LLM Leaderboard org

@kno10 I think I agree with you, and will flag this model to avoid people building on top of it accidentally.

@CultriX I'm still investigating this other issue - I'm hoping to resolve it soon, thanks for your patience.

clefourrier changed discussion status to closed

Sign up or log in to comment