[FLAG] AIDC-ai-business/Marcoroni-7B-v3

#471
by Q-bert - opened

I found new things about AIDC-ai-business/Marcoroni-7B-v3 model. I think they used eval dataset because they dont explain anything and they changed readme at first. My head is confused.

I mentioned it here too.
https://huggingface.co/AIDC-ai-business/Marcoroni-7B-v3/discussions/7

https://huggingface.co/AIDC-ai-business/Marcoroni-7B-v3/commit/564cc988785422fe16a240e42dcacc45b4691df8

@clefourrier

My model's scores:
image.png
Their model's scores:
image.png

And there are many marcoroni merged models, I can't get into them at all.

Open LLM Leaderboard org

Hi! Thanks for the flag.

FYI, the authors have asked us to remove the bfloat version of their model from the leaderboard here because of a dataset problem, but did not request the float16 model to be removed - I'm assuming it's not contaminated (though if it's trained on a variant of MetaMaths it could be).

Let's wait for the authors to answer as we don't flag models without more proof.

Tagging @AIDC-ai-business - did you conduct contamination analysis on the float16 model too?

deleted

We used three parts of widely used open source ranking data and removed the possible duplicate parts, and finally used a part of them without anything additional data.

The data set contains:

-- Intel/orca_dpo_pairs: https://huggingface.co/datasets/Intel/orca_dpo_pairs

-- argilla/ultrafeedback-binarized-preferences: https://huggingface.co/datasets/argilla/ultrafeedback-binarized-preferences

-- berkeley-nest/Nectar: https://huggingface.co/datasets/berkeley-nest/Nectar

deleted

If there are no problems with the model Q-bert/MetaMath-Cybertron-Starling, and the open source data set we use, then there is no problem with eval dataset as a whole.

Just adding my two cents. I happened to run my person test on both these models and MetaMath-Cyberton-Starling scored notably higher than Marcoroni v3 (61 vs 57).

This of course isn't evidence of contamination, but what stands out is the 5 point gain on TruthfulQA. I don't understand how DPO fine-tuning of a model that was already heavily DPO fined tuned (e.g. Cybertron) can raise the TruthfulIA at all, let alone 5 points, without being contaminated with TruthfulQA data.

I also tested well over 20 Mistrals and the large >1 point bump on HF over M-C-S, and notable drop in my testing, is the most drastic I came across. Even if Marcorni v3 isn't contaminated, I am 100% sure it doesn't belong higher on the leaderboard. Somehow it found a way to compromise real-world performance in order to do better on tests, especially TruthfulQA. And the constant bragging about being number one on the model card is more than a little suspect. Again, I am 100% certain this LLM's score is not earned, whether or not it's contaminated.

Just adding my two cents. I happened to run my person test on both these models and MetaMath-Cyberton-Starling scored notably higher than Marcoroni v3 (61 vs 57).

This of course isn't evidence of contamination, but what stands out is the 5 point gain on TruthfulQA. I don't understand how DPO fine-tuning of a model that was already heavily DPO fined tuned (e.g. Cybertron) can raise the TruthfulIA at all, let alone 5 points, without being contaminated with TruthfulQA data.

I also tested well over 20 Mistrals and the large >1 point bump on HF over M-C-S, and notable drop in my testing, is the most drastic I came across. Even if Marcorni v3 isn't contaminated, I am 100% sure it doesn't belong higher on the leaderboard. Somehow it found a way to compromise real-world performance in order to do better on tests, especially TruthfulQA. And the constant bragging about being number one on the model card is more than a little suspect. Again, I am 100% certain this LLM's score is not earned, whether or not it's contaminated.

Hello, I remember replying to your issue and updating our weight. I'd like to know if you're testing with the updated weights, and furthermore, I'd like to know if you can open source your evaluation method to better compare the respective models.

viethq188/LeoScorpius-7B-Chat-DPO is the DPO model of LeoScorpius-7B(merge from Marcoroni-7B-v3 and MetaMath-Cybertron-Starling) , now ranking 1st at the leaderboard. They used DPO to improve their TruthfulQA score from 63.95 to 68.83.

deleted

@xxyyy123

  1. I didn't even bother testing the LLM you mentioned because 68.83 was obvious nonsense.

  2. Yes, I re-tested and it did MUCH better. Something went so wrong with the prior version I couldn't even complete the test because it was outputting random things.

  3. You have every right to be cynical of my test because it's private. But it needs to remain private because it's my way of validating LLMs that my have found a way to game the HF tests. So your request is perfectly valid, but it can't be made public. Plus it requires a lot of subjective analysis (e.g. scoring stories, such as frequency and egregiousness of contradictions).

deleted

@xxyyy123 To clarify, I'm not saying this is about contamination. Why would contamination alone cause any drop in performance?

Perhaps the very act of fine-tuning a slerp causes the performance issues I've observed. All fine-tunes of slerps, not just yours, perform worse, and in a similar way. Simply put, they become more stubborn. That is, the nuances in the user's prompt and data in the underlying foundation model are ignored more, and in favor of the fine-tuning, resulting in things like a spike in story contradictions.

And thinking about it now, ignoring the data in the foundational model more should increase the TruthfulQA (ignores more of the falsehoods it contains). Plus performance on simple to objectively grade llm tests like the MMLU wouldn't be impacted because almost everything in fine-tuning (PPO) is factually correct, so ignoring the foundational model should, if anything, increase scores on said tests.

We used three parts of widely used open source ranking data and removed the possible duplicate parts, and finally used a part of them without anything additional data.

The data set contains:

-- Intel/orca_dpo_pairs: https://huggingface.co/datasets/Intel/orca_dpo_pairs

-- argilla/ultrafeedback-binarized-preferences: https://huggingface.co/datasets/argilla/ultrafeedback-binarized-preferences

-- berkeley-nest/Nectar: https://huggingface.co/datasets/berkeley-nest/Nectar

You are lieing.

image.png

You wrote this at README.

I think there might be some misunderstandings, so let me clarify the timeline:

First, we completed the training of DPO and uploaded the model.

When submitting for the bf16 evaluation, we realized that we needed to have a ReadMe. To save time, we directly copied the ReadMe from AIDC-ai-business/Marcoroni-7B-v2, modified the version number, and submitted the evaluation.

After the evaluation results came out, we found that the results were very low. Additionally, upon checking the model's issue by @Phil337 , we discovered that there were problems with our submission of the model weights.

We updated the weights and raised an issue on the leaderboard, hoping to withdraw our bf16 submission.

After updating the weights, we noticed that others had submitted our model for the fp16 evaluation and it was ranking high on Hugging Face. Realizing that others would see our model, we quickly completed the refinement of the model's ReadMe.

image.png

The issue raised on the leaderboard was resolved, and we re-evaluated our model on bf16.

I hope this can clear up your confusion.

@Q-bert @clefourrier

In fact, the update of the parameters and the readme happened within just a few hours and occurred 4 to 5 days ago. I believe that most people would have seen the updated information.

FYI, there is a discussion on ultrafeedback-binarized and Nectar with regards to data contamination.
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard/discussions/474

Open LLM Leaderboard org
edited Jan 5

Since this model was removed from the hub, I'm going to make sure its results are flagged as "deleted" and close the issue

clefourrier changed discussion status to closed

Sign up or log in to comment