open-llm-leaderboard/open_llm_leaderboard · [FLAG] CultriX/MistralTrix-v1 is based on quantumaikr/quantum-v0.01 and mncai/mistral-7b-dpo-v5

Jan 23, 2024

CultriX/MistralTrix-v1 is apparently based on zyh3826/GML-Mistral-merged, which is a merge of quantumaikr/quantum-v0.01 and mncai/mistral-7b-dpo-v5, both of which are flagged with the link https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard/discussions/474.

clefourrier

Open LLM Leaderboard org Jan 25, 2024

@CultriX could you add more details about your model so we can know if there are contamination risks?

CultriX

Jan 25, 2024

Hi,

It's as described on the model page basically! I added that description below in this post.
TLDR: It's zyh3826/GML-Mistral-merged-v1 fine-tuned with Intels DPO dataset (Intel/orca_dpo_pairs).

Hope that clarifies things!
If you need/require anything else let me know and I'll try to answer your questions to the best of my abilities.
I am however an undeniable amateur when it comes to all of this, so I might not have all the answers!
Am just messing around basically, and this time it happened to turn out decently enough and produced a pretty nice model I suppose :)!

With regards,
-CultriX-

######## DESCRIPTION #########
MistralTrix-v1 is an zyh3826/GML-Mistral-merged-v1 model that has been further fine-tuned with Direct Preference Optimization (DPO) using Intel's dataset for neural-chat-7b-v3-1. It surpasses the original model on several benchmarks (see results).

It is directly inspired by the RLHF process described by Intel/neural-chat-7b-v3-1's authors to improve performance. I used the same dataset and reformatted it to apply the ChatML template.

The code to train this model is available on Google Colab and GitHub. Fine-tuning took about an hour on Google Colab A-1000 GPU with 40GB VRAM.

TRAINING SPECIFICATIONS
LoRA configuration peft_config = LoraConfig( r=16, lora_alpha=16, lora_dropout=0.05, bias="none", task_type="CAUSAL_LM", target_modules=['k_proj', 'gate_proj', 'v_proj', 'up_proj', 'q_proj', 'o_proj', 'down_proj'] )

Model to fine-tune model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.float16, load_in_4bit=True ) model.config.use_cache = False

Reference model ref_model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.float16, load_in_4bit=True )

Training arguments training_args = TrainingArguments( per_device_train_batch_size=4, gradient_accumulation_steps=4, gradient_checkpointing=True, learning_rate=5e-5, lr_scheduler_type="cosine", max_steps=200, save_strategy="no", logging_steps=1, output_dir=new_model, optim="paged_adamw_32bit", warmup_steps=100, bf16=True, report_to="wandb", )

Create DPO trainer dpo_trainer = DPOTrainer( model, ref_model, args=training_args, train_dataset=dataset, tokenizer=tokenizer, peft_config=peft_config, beta=0.1, max_prompt_length=1024, max_length=1536, )

CultriX

Jan 25, 2024

For my understanding:

You first stated the model was contaminated because it used a dataset (https://huggingface.co/datasets/Intel/neural-chat-dataset-v2) which it did not use as somebody pointed out in response to your post (see: https://huggingface.co/CultriX/MistralTrix-v1/discussions/6)
However you then stated that the model might still be contaminated as it's a fine-tune of zyh3826/GML-Mistral-merged-v1 , which in turn is a merge of quantumaikr/quantum-v0.01 and mncai/mistral-7b-dpo-v5. If I understand you correctly you are saying that quantumaikr/quantum-v0.01 and mncai/mistral-7b-dpo-v5. are both flagged (which is true I checked) and therefore my model is contaminated as well.

However, one thing that is not quite clear to me is how you know that those are indeed contaminated? The link to the discussion leads to this one: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard/discussions/474, and a quick pagesearch for those models on there only leaves me with one hit which is your latest post about MistralTrix.
In addition to that that, I find it strange that the zyh3826/GML-Mistral-merged-v1 model performs well on the Nous benchmark as well which does not include GSM8K data.
(See: https://huggingface.co/zyh3826/GML-Mistral-merged-v1/discussions/2)

To be clear: I am not saying that I know for sure that my model is not contaminated and that you are a liar, not at all!
I'm just trying to understand how you seem to know for sure the models my model is based on are indeed contaminated so that I can look out for similar problems myself should I try to make a new model in the future :)!

Thanks for your time!

-CultriX-

clefourrier

Open LLM Leaderboard org Jan 25, 2024

Hi to you both!

@CultriX I understand where you come from! I'm trying to go back to the original flag of both these models - it's very likely that either the wrong conversation was linked there, or that a message flagging these models was edited and that we lost the information that way. I should make an excel sheet or a dataset of the flags, with the model name, reason, and flag author.

@kno10 Would you be so kind as to open discussions on the model pages for both these models, so we can ask the authors directly if their models were accidentally trained or fine-tuned on possibly contaminated data?

kno10

Jan 25, 2024

I am not accusing @CultriX of any intentional contamination. My interest is solely on reliable evaluation, and contamination harms the evaluation; and the performance of a lot of the current "merge" models appears due to contamination not actual improvements...
He thankfully shared a colab notebook that allowed seeing how the model was trained.

The initial assumption with neural-chat-dataset-v2 was wrong, as it had not been used. But the base model is zyh3826/GML-Mistral-merged-v1.

According to https://huggingface.co/zyh3826/GML-Mistral-merged-v1, this is a merge of quantumaikr/quantum-v0.01 and mncai/mistral-7b-dpo-v5. Hence the link to #474 above. This is all I could find, because the forums here do not have a working search, or any kind of tracking for these flags that I could find!

All I know about these two models is that they are "flagged" with a link to #474 in the leaderboard. I cannot provide further information. In #474 I was asked by @clefourrier to make a new issue for this model...

In #474 you can find a post https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard/discussions/474#6580425f16eb2b758e63c5c3 that shows that quantum-dpo-v0.1 is based on rwitz2/go-bruins-v2.1.1, which apparently is based on jan-hq/trinity-v1, which is apparently a merge of viethq188/LeoScorpius-7B-Chat-DPO, which apparently was trained on Nectar, which apparently contains TruthfulQA.

I could not see why mncai/mistral-7b-dpo-v5is flagged, my assumption is that it was flagged along with mncai/mistral-7b-dpo-merge-v1.1 (which is described as "merge mncai/mistral-7b-dpo-v6, rwitz2/go-bruins-v2.1.1, ignos/LeoScorpius-GreenNode-Alpaca-7B-v1, janai-hq/trinity-v1", two of which appear to be known as contaminated). Given the extremely close performance of the two, I assume that it was also contaminated, otherwise the merge would have made a larger difference?

Hence I suggest to test this model for contamination. Maybe it would even be best to pause the leaderboard (and hide all merged/finetuned models by default) until this mess is sorted out somewhat. The leaderboard has ceased to be useful without contamination testing.

clefourrier

Open LLM Leaderboard org Jan 25, 2024

•

edited Jan 25, 2024

Opening a discussion here was definitely the correct thing to do, and thanks for having done that @kno10 !
As you highlighted, some of the parent models have been flagged, but there is no easy way to go back to the why, as they are not mentioned in the current version of the linked discussion. As explained above, this can be due to either an error on our side where we linked the wrong discussion, or to someone editing their comment afterwards.

To investigate if we should keep the flag of the parent model (while we look in the discussions archives), I opened a discussion here to ask the quantum authors about their model.

CultriX

Jan 25, 2024

I want to say thanks to the both of you for the respectful tone of this conversation!
For the record I fully agree with you @kno10 and I hope my initial response did not come across as me feeling attacked by you cause I don't.

Let me be clear: If my model is indeed contaminated it's only fair that it gets flagged as such and it would indeed be bad for the general community and way forward if it was overlooked. Therefore I am thankful for you actually looking into that stuff, as I must honestly admit that I had absolutely no idea the data might have been contaminated in the first place (it's becoming really hard to keep track, especially with all the merges and the apparent recent increase in confusion amongst the userbase).

That said I must also admit that it's a little bit annoying that it's now flagged (hence not appearing) based on the assumption that it's based on a model that is based on models that are presumed to have used contaminated data in one way or the other (even though the "proof" for that, which is that it's flagged, links to a discussion that does not mention either model a single time).

Now as @clefourrier so generously pointed out there could be multiple reasons as to the why that is, and that seems like that could very well be the case here.
However until that is sorted out it kind of seems like we are utilizing a "guilty until proven innocent" approach, whilst I personally would've preferred an "innocent until proven guilty" one.

Thanks for looking into this to the both of you though!
With regards,
-CultriX-.

clefourrier

Open LLM Leaderboard org Jan 26, 2024

HI @CultriX ,

We are following an "innocent until proven guilty approach", by the way - your model has not yet been flagged, the only reason why don't see it appear in the main view is because it's a moerge and we don't display these by default (as they don't interest the broader part of the community) :)
You'll find the results by unchecking the "hide merge and moerges" model checkbox

kno10

Jan 26, 2024

@CultriX you cannot use Ctrl+F to find the reference to quantum-dpo-v0.1, because it is only in a screenshot: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard/discussions/474#6580425f16eb2b758e63c5c3

For all I can tell from the linked threads, there is a documented inheritance via fine-tuning/merging:
CultriX/MistralTrix-v1 -> zyh3826/GML-Mistral-merged -> quantum-dpo-v0.1 -> rwitz2/go-bruins-v2.1.1 -> jan-hq/trinity-v1 -> viethq188/LeoScorpius-7B-Chat-DPO -> Nectar/ultrafeedback-binarized -> TruthfulQA.

That is probably what would be considered "guilty of data contamination"?

CultriX

Jan 27, 2024

•

edited Jan 27, 2024

Thanks again for the continued time invested and explanations provided! If that is the case I agree with the stated assumption. However, it does raise curiosity just how many of the models on the board would have some sort of contaminated data in them if one dives deep enough into the models heritage...

@clefourrier thank you for clarifying that, but I was aware of the fact merges do not show up unless you uncheck that box! It seems that I was misguided however by this other issue that I am experiencing where pretty much none of my models show up unless I uncheck the "hide deleted/private" checkbox as well, even though the models are very much publicly available?

(Note: I actually made a separate post about the issue so it's a little bit off-topic in regards to this thread. But since I caught you here and it's actually not resolved yet I figured you might know why this is occurring? The reason why I'm mentioning it now is because it gave me the false impression my models weren't showing up due to being flagged even though I now realise that was not the case. Sorry for my grumpy "guilty until proven innocent" remark that was a bit uncalled for in hindsight... As I don't want to change the topic of this discussion any more than this, here's the link to the one I opened on the issue: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard/discussions/560. )

Cheers!

clefourrier

Open LLM Leaderboard org Feb 1, 2024

@kno10 I think I agree with you, and will flag this model to avoid people building on top of it accidentally.

@CultriX I'm still investigating this other issue - I'm hoping to resolve it soon, thanks for your patience.

clefourrier changed discussion status to closed Feb 1, 2024