How did you get the MMLU so high?

#2
by distantquant - opened

The MMLU is like 7 points above even 70B models and the MT bench beats miqu 70b which was the top "open" leader until now.

What methods were used for training this model for such performance?

CausalLM org

One possible reason for the model's impressive performance is the use of LLMs with large context window to process pre-training data of plain text, creating massive synthetic datasets for continual pre-training on the base model. However, as for the MMLU score, it's more likely a cumulative effect – non-subjective data contamination is inevitable. Therefore, I believe benchmark scores, especially MMLU which is strongly correlated with a model's knowledge capacity, don't necessarily directly reflect the model's general capabilities. As such, I'm hesitant to tout this high MMLU score as I don't see any tangible benefits it brings, but rather present evidence of non-subjective contamination to avoid unfounded accusations and debates.

On the other hand, I have no intention of making biased comparisons with other models, as we all fall far short of GPT-4's capabilities, even in specific tasks and niche areas. My purpose in releasing this model is more to showcase what we can currently achieve – it's not simply a matter of scaling training data and model size to attain OpenAI level improvements.

Ah, cool, thanks for the response.

Math..
The Yi-34B in the contamination contrast tool scores 0.3 same as the 7B.
0.38 - 0.30 = 8
85.6 / 100 * 8 = 6.8
85.6 - 6.8 = 78.8

Real MMLU score: 78.8 .. which is in line for what this model performs. I got it up to 79.06

@fblgit
I have no intention of arguing with you. However, your approach of subtracting the contamination probabilities and then applying this difference directly to the MMLU score assumes a linear relationship between contamination and the score. This, to me, feels like an accusation that I trained on the MMLU test set, as your calculation seems to only hold under this assumption.

And here is the recalculated equivalent MMLU score for microsoft/Orca-2-7b based on your assumption, which is interesting:

0.77-0.22 = 55
56.37 / 100 * 55 = 31
56.37 - 31 = 25.37

MMLU score for microsoft/Orca-2-7b : 56.37
"Real" MMLU score for microsoft/Orca-2-7b : 25.37

In other words, according to your methodology, the actual performance of Orca-2-7b is similar to the accuracy of random choice.

Share the dataset so can be analysed...
Orca is known to be extremely contaminated.
There is a 8% increase contamination, aligned with an 8% increase performance over this model in MMLU.
Often contamination takes place without notice.

@fblgit We have shared a subset of our training data: CausalLM/Refined-Anime-Text

Due to the significant cost involved, as a self-funded project, given we do not have explicit revenue or competitive targets currently, we will initially only consider releasing a subset of the data, and I believe this will still be order of magnitude larger than some other synthetic datasets publicly available on hugging face. However we will release some other subsets in the future, so stay tuned.

As for the computation you mentioned of subtracting the probability figures directly to arrive at the difference probability, this is only valid if these are two mutually exclusive and independent events - which is patently not the case here.

And there is no evidence to suggest that there is a linear relationship between the data contamination probability and the MMLU score. I have already provided a sufficient counter-example above with the orca example.

I am willing to attribute this to an honest mistake. I trust you did not intentionally introduce such a fallacious red herring in an attempt to stir controversy, as anyone with even a rudimentary understanding of probability theory could quickly see through it.

Orca is known to be extremely contaminated.

I would appreciate if you could refrain from making further disparaging comparisons and unsubstantiated accusations here about other models’ performance. I also made it clear in my response above that I do not believe the increase in MMLU score actually translates to improved performance on downstream tasks, and that I believe we are still far from OpenAI GPT-4, and that any victories on narrow subdomains are not very meaningful.

Hi,

The contamination tests behind this 34B's is large and diverse, they do provide a contrastive way to compare models.
The difference between a model with the same base in MMLU contamination is 0.3 and scores 78.+
This model has a contamination of 0.38 (8%+) on MMLU and is showing 8% increase in the MMLU score.. it kinda makes sense to me and its merely rationale common sense, 8% improvement in score while a 8% increase in contamination 1:1.
You dont need to release anything that you don't want, you can run these contamination analysis over your datasets and verify it on your own.

Regarding Orca, lets look at the numbers provided by the same contrastive contamination tool that you are using.. I believe Orca's are there in the top contamination detection.
Truth is rephrasing elements are also detected with the tool.

CausalLM org
JosephusCheung changed discussion status to closed
JosephusCheung locked this discussion

Sign up or log in to comment