[FLAG] fblgit/una-xaberius-34b-v1beta

#444
by XXXGGGNEt - opened

It was trained from Yi-34B, however:
Listed between 2 flagged models, with half of its size.
image.png

compared with its base model:

ARC 64.59->70.39 (โ†‘8%)
TruthfulQA 56.23->61.45 (โ†‘5.22%)
GSM8K 50.64->63.38 (โ†‘25.2%)

The maker claimed that he used a method called UNA, however he never say a word about what it is. He said that he will share later when he is free, but I think he will never say anyting about that. And it is a suspicious one man work.

and there are discussion about its hallucinations, indicating that it was overfitting on benchmarks: https://huggingface.co/fblgit/una-xaberius-34b-v1beta/discussions/2 and https://www.reddit.com/r/LocalLLaMA/comments/18ejwm1/what_is_fblgituna_unified_neural_alignment_looks/

Either he explain what he had done, or this model should be flagged.

These endless cheats are killing the credibility of the leaderboard. @clefourrier

Not sure If I'm convinced, but transparency on UNA would be nice, especially since it's popping up more and more (e.g. in mergers).

Edit: Something's not right. People keep taking the Mistral leader, adding some mysterious DPO to it, and now they're scoring higher than any Llama 70b, such as the current leader Marcoroni 7b v3 with a score of 72.5 (https://huggingface.co/AIDC-ai-business/Marcoroni-7B-v3).

UNA models are not like non-UNA models, their properties are unique and this is now known. This FLAG is a nonsense and I won't release something that indeed can be dangerous to society, specially under bigcorp.com... U can say whatever u want, bring a contamination evidence.. u making it look like the models are not extremely intelligent :)

Either he explain what he had done, or this model should be flagged.

What a nonsense... u should avoid type under the influence.

UNA models are not like non-UNA models, their properties are unique and this is now known.

So you are not going to say a word about the method, leaving alone the suspicious 'good' performance?
Fair enough, doubt that you cannot tell us the truth as you cheat, either direct training on bench or its rewrites.
Or you should say something before we examine the difference of losses on train/test/dev sets and prove it is seriously contaminated on puropse.

You have been evading direct answers to your different methods. Combined with this questionable result, if you cannot explain it, then as an observer other than yourself, you can only draw one conclusion: you have not maintained the most basic integrity.

So you are not going to say a word about the method, leaving alone the suspicious 'good' performance? LOL...
IDK man, the thousands ppl highlighting the unique of UNA and its models... the hundreds of people training for second time their model without degrading.. .. the score... the performance in real task.. is not enough.. What is needed is you to validate the code... hah...
this really looks like a kid in the supermarket whimping in the floor for a source code he dont have.. not much arguments, no rationale... Talk is free, commits are not..

Go find some evidences to support your fairy tale.. So far, UNA models are SOLID.. Juanako, Cybertron, Xaberius.. they are Unique UNA Kings .. no contamination.
The only contamination resides in your mouth.. And to confirm, Im not sharing with you the source code kid.

UNA : Uniform Neural Alignment. It goes in the Attention and the multiperceptrons, and it does what it says. There are multiple phases... Juanako = UNAv1 only implemented at perceptron level. Cybertron = UNAv2 .. applied both MLP and Attention.. Xaberius = UNAv1.. meaning I can release a much more powerful version of it. Its based on platypus-34b and if u compare the performance.. its not too distant from it. And if u compare what UNA increases (Rationale/Logic capacity).. u'll see the pattern across them.

End of the story, and if I feel being extorted for a source code.. I'll do magnets :)

cringe.
cuz there is NO method but cheat on bench data. and no one is begging for your code boy.
Kejserens nye Klรฆder
You're just too afraid to say a word on your mythical method.

Go find some evidences to support your fairy tale

The only contamination resides in your mouth.. And to confirm, Im not sharing with you the source code kid.

I'm sorry but someone as rotten as this is clearly not mature enough to be taken seriously lol

Yah.. it's ur professional and expert assessment...

Present the evidences looneyboy.. the models are there ... What stops U from supporting words with evidences ?

What are you afraid of?
Of course we will prove that soon.

Im looking forward for you to prove what doesnt exists :D
Sad part will be that once you arrive that conclusion, you may not have what it takes to come back here to close this flag :)

deleted

@fblgit I don't know squat about the inner workings of the transformer architecture, but If there's something special about UNA's design that has nothing to do with the weights themselves, but rather the transformer layers, then why can't it be done and not done to the same foundational model and fine-tuning datasets in order to quantify the performance gain? And can it be done to an existing LLM like Starling or Dolphin to give them a boost so the change in performance can be observed, or does it have to be applied during fine-tuning?

I'm honestly curious and am not trying to accuse anyone of anything. But others have a right to be cynical. If you witnessed someone else's LLMs getting performance boosts and taking over the leaderboard thanks to an unexplained technique that they refuse to explain in even a general way, would you blindly accept it? Is keeping your discovery secret so important that it's worth causing tension in this community?

Perhaps reviewing fblgit's previous paper will provide some insights - https://github.com/fblgit/hypothetical-frameworks. And don't worry, it's preserved in the Wayback Machine, wouldn't want to lose this valuable work after all!

Thanks @deus-ex-machina , this was too technical for me, but to my eyes it came off professional and legit. However, what stood out the most was that despite the lack of details (just vague ideas) it was littered with IP warnings, such as the warning at the end about ICL not being for commercial use and PATENTED stamped on all the vague readmes from 9 months ago. If there's actually something to UNA, and you're in it for the money, then submit it to the patent office.

I tried the una models and they, ironically, went their own ways and didn't respect the nuance of my prompts as well as other LLMs (e.g. story prompts). So una appears to have failed at its stated objective of better aligning AI outputs with humans, at least for now.

@Phil337 Indeed, it is highly unusual to publish a paper so lacking in details or citations. Personally, I find the text lacking in...substance, almost like it's generated filler that's written to look like it means something, but it really doesn't. But, that couldn't be the case, right? Right?

copyleaks.com Scans
copyleaks_com_scan.png

gptzero.me Scans
gptzero_me_scans.png

While I will admit false positives are a possibility, these numbers seem a bit...too high for it to be that. The ICL seems to be a false negative with copyleaks, as picking it apart in chunks it did find patterns matching what it believed to be AI generated.

Open LLM Leaderboard org

Hi @fblgit , just to clarify something: flagging a model indicates that its results on the evals we are using are dubious - could you please indicate what is in your training set?

deleted

@deus-ex-machina I sensed that the substance was lacking but couldn't comment on that since I don't have the requisite expertise. Although he deserves some credit for making it sound smart and legit to a someone like me with only a basic understanding of LLMs & fine-tuning.

@clefourrier its in a private repo of my user, I allow you to look at it by yourself and share it with the community, hope the telenovela crisis ends. And please, if you guys had run some other extra tests .. share the results as well.. I think HF have compute to run the UNA's against all tests ever created or even make a new one that no UNA could possible even have inside its corpus.

I think that what is true is the damage done by contamination.. Make an hf-eval private that gets released every year, it could help also to evaluate the cut-offs.. meanwhile the dataset has a checksum stated on the results and matching its release, then that should provide and audit-trail of transparency while guaranteeing the scoreboard safety.

In regards of Hypothetical Frameworks, is a skeleton, a boilerplate and there shouldn't be content, sorry to disappoint.. But if u felt compelled by it, is the boilerplate of a paradigm similar to Simonyi Intentional Programming and it exists and works.

deleted

I still don't think @fblgit test scores are all that suspicious, regardless of what UNA is or is not.

The boost in scores provided by the careful pairing of SFT and DPO, and stripped of heavy-handed alignment (minimizing the alignment tax), should be able to achieve around those scores.

I'm seeing the same ~ bump from others like Intel when they combine fine-tuning methods like RLAIF, DPO & SFT (Orca). And when I run my personal test on them their combined fine-tunes perform better. For example, Intel's neural 3.3 slerp with MetaMath and Slim Orca SFT scores higher not only on the leaderboard, but my own personal testing. If there is contamination in UNA models, it's got to be very subtle.

Go find some evidences to support your fairy tale

The only contamination resides in your mouth.. And to confirm, Im not sharing with you the source code kid.

I'm sorry but someone as rotten as this is clearly not mature enough to be taken seriously lol

You are awefully guilty until proven innocent with this. There is no evidence of cheating scores, and until you find any evidence everything you are saying is superflous.

Just as some folks gracefully asked, I found it fine to help on that aspect:

0.5 AVG increase on Phase 1, crappy training.. no SFT, no nothing... just UNA...running Phase 2 now.. likely to yeld higher results..
https://huggingface.co/datasets/open-llm-leaderboard/results/commit/b6fecd25d067d715b41ea330e0b5cbd4f5eae2f0

Results on Phase 2 has been released by HF.

0.9 AVG increase on Phase 2 compared to the original model. And its just the epoch 2 checkpoint..
https://huggingface.co/datasets/open-llm-leaderboard/results/tree/main/one-man-army/una-neural-chat-v3-3-P2-OMA

Most likeley people will be asking..can this be done to any model and achieve boost ? ... wdyt..

@clefourrier do we have any update on this? I do understand the wish of discussing UNA widely across the community and it can be done on a proper moderated thread, but I don't feel comfortable to having such as an argument over a superflous accusation.

blud just got beaten by a 10.7b fr ๐Ÿ’€๐Ÿ’€๐Ÿ’€๐Ÿคก๐Ÿคก๐Ÿคก

Nice, happy to hear.. Im not sure wether ppl will notice the metamathqa scandalo.. wino<>gsm|tqa|arc dispair :D jajaja

blud just got beaten by a 10.7b fr ๐Ÿ’€๐Ÿ’€๐Ÿ’€๐Ÿคก๐Ÿคก๐Ÿคก

You are a child.

@distantquant lol cope

Open LLM Leaderboard org

Hi!
@fblgit We ran a contamination detection tool on this model, and it's got a 99% chance of being contaminated on GSM8K.
(Code base is here if you want to reproduce, and @SaylorTwift is working with the authors to make it an easier to use tool).

I'm therefore going to flag this model as contaminated on GSM8K - I suggest you look at your fine-tuning sets in detail, as a lot of common ft sets integrate known benchmarks, and accidents can happen.

clefourrier changed discussion status to closed

Well @clefourrier ..there is no FineTuning, SFT or any step on UNA to add any sort of data on the corpus.. so I cant look at something that doesn't exists..

Please run your same test on the base model of Xaberius..
https://huggingface.co/bhenrym14/platypus-yi-34b
and
https://huggingface.co/01-ai/Yi-34B

Because a model that doesn not have SFT and is flagged of contamination.. it can only come from its base model :)
In any case, we are running this on platypus-yi-34b i would suggest you guys to run the yi-34b.. expect a larger impacted models, as said.. this contamination aint coming from us.

Also, the code you facilitated .. doesn't support GSM8k at all.. please share the code that flagged the contamination so it can be reproduced on this model and other models.

If they find those models are contaminated, then this model should remain flagged as such, as should any others that used the contaminated models as any basis.

Looks like you'll need to find a new base model that isn't contaminated to work from, as it's you, or them. Take your pick.

Hi!
@fblgit We ran a contamination detection tool on this model, and it's got a 99% chance of being contaminated on GSM8K.
(Code base is here if you want to reproduce, and @SaylorTwift is working with the authors to make it an easier to use tool).

I'm therefore going to flag this model as contaminated on GSM8K - I suggest you look at your fine-tuning sets in detail, as a lot of common ft sets integrate known benchmarks, and accidents can happen.

You should also run this on all metamath models

Open LLM Leaderboard org

Hi @distantquant , do you want to do it and tell us the results?

deleted

@distantquant I thought about this as well. But wouldn't it be easier to scan the MetaMath database?

After reading the paper and spot checking the database everything appears to be legit. Some degree of accidental contamination/overlap is inevitable, even if MetaMath was released before math tests like GSM8K. However, the concept and application of MetaMath is sound. Math is an emergent property of LLMs at progressively larger sizes, and carefully converting mathematical expressions into natural language proofs, which meshes better with LLMs, does in fact increase the mathematical abilities of smaller LLMs.

https://huggingface.co/datasets/meta-math/MetaMathQA

Hi @distantquant , do you want to do it and tell us the results?

@clefourrier
I will. Share the code U used in full to evaluate contamination on Xaberius, do not ask regular users that can't spin 1000 H100's.

The transparency of the platform as well as the leaderboard is right now in jeopardy until the real producible contamination eval can be performed and the bloodline of xaberius can be evaluated fairly.. the fact of telling a user to run it if he want but not facilitating the code.. is already suspicious.
You guys opened Pandora's box, and we all will have to walk that road and bear the consequences of it.. unless this just just a movement against single individuals and with the goal of favour AdWords and marketing deals between labs and HF.

Until that code is out, transparency.. and overall trustworthy of this platform is nearly 0.. either HF shows to be impartial.. or we can assume this is just like bing.com

@fblgit I believe the repo he linked supports using any dataset from huggingface. You would just replace truthful_qa with gsm8k and the target model with your own models.

If we want to talk transparency and trustworthiness, I'd say all submitted models should be checked for contamination up front, and incomplete / insufficient model cards claiming vague methods without evidence, citations, or explanation should not be allowed on the board. There is too much opportunity for cheating and fraud when people can claim some "new" method they made up as a cover, and evade ever providing an explanation.

And gaslighting, how lovely, a real piece of work.

@HDiffusion tried bro.. gsm8k is not supported :) so im asking for the code that was run, since gsm8k eval contamination is not in the code at all..it only supports truthfulqa, mmlu, arc.. not gsm8k :)
Release the GSM8k contamination test so we can run this on the model as well as others.. dont point to a repo that cannot perform the GSM8k contamination evaluation.. or if u do, provide a git patch/gist anything that can be patched and run..

Open LLM Leaderboard org

@fblgit I'm unsure what you are looking for as I already linked to the repo.
You just need to modify this line of the script to load GSM8K instead of truthfulqa and this file to process truthfulqa to print the prompt followed by the correct examples (following the provided examples). I assumed that you were someone who codes, but I'll ask the repo owner to upload the modified files if you need them, it's of course not a problem.

Side note, I would appreciate it if you could stay civil, as we are doing our best to make the leaderboard a good reference.

Open LLM Leaderboard org

@deus-ex-machina I agree on the model cards, we added a check a couple months ago on the text length - I agree it's not enough for now, but it's a start!
We've been brainstorming better solutions and checks to add on model cards before submissions but haven't reached something more conclusive.

Sorry dont get me wrong, U need to be on the shoes of someone who has not done SFT or other similar on the model being flagged :) I also dont want people to think that we run just other code different as yours, thanks for confirming the code placement. Here is the result over the first blood of Xaberius which is.. platy-34b .. which you guys over there in HF know due its tokenizer issue.. why this hasn't been traced or tried to internally fairly challenge, thats something I won't speculate.. just leave it there.. neither about the obvious boost in all other tests which you guys also have tested and found nothing..

Screenshot 2023-12-15 at 12.21.46 AM.png

The community is demanding more tests, and i don't think is right to tell them to self-run it.. at the end of the day, you can see such thing without code is not possible.. in the code, its quite simple to flesh.. Regarding the repo changes, I think everyone deserves the possibility of check contamination over the current Leaderboard tests. I have no problem to push a PR, talk is free.. commits are not.. and I prefer commits that talk by themself :)

A basic blood-tie traceback to the contamination, indicates:

Screenshot 2023-12-15 at 12.40.57 AM.png

I think the decrese in contamination from its parent is a mark.

But also it seems to be present here:
Screenshot 2023-12-15 at 1.11.15 AM.png

I guess this gives a better idea of what happened here.

You so shady bruh!! To the point where this is funny..

Hi @distantquant , do you want to do it and tell us the results?

inb4 leaderboard implosion

inb4 leaderboard implosion

inb4 all the 7B models that outdo 70Bs mysteriously get deleted overnight

inb4 leaderboard implosion

inb4 all the 7B models that outdo 70Bs mysteriously get deleted overnight

Good, then maybe we can actually find out which ones are good

UNA models are not like non-UNA models, their properties are unique and this is now known. This FLAG is a nonsense and I won't release something that indeed can be dangerous to society, specially under bigcorp.com... U can say whatever u want, bring a contamination evidence.. u making it look like the models are not extremely intelligent :)

Either he explain what he had done, or this model should be flagged.

What a nonsense... u should avoid type under the influence.

You should probably appologize since it has become clear by your own testing that the Flag is indeed NOT nonsence.

UNA does not have SFT or training like that.. there is no data.. This doesnt originates on this model. And I do not challenge the contamination test, as you can see I provided the result.. because it has to be fair.
The contamination, exists, but its not made by me.

I wonder if adding a LLM Contaminator test as described here https://lmsys.org/blog/2023-11-14-llm-decontaminator/ would help keep the leaderboard clean

You mentioned MetaMathQA, however it is reproducible for open method and open data.
You can actually use the same method to get exonerated if you are not consciously deceiving. But at present, it seems that you were only caught cheating, and then you still cannot or unwilling tell the details of your fictitious method that does not exist.
Disclosure is fair to everyone, and I am concerned that the long-term downstream contamination caused by further merging of your model will harm the entire leaderboard.
Including the previous tigerbot cheating incident, if it is widely used in mergers, this will become a long-term scandal. I recommend performing comparisons between similar homologous mergeable models, on low-rank features from differences between the cheating model and the baseline model, and evict the merged cheating model.

I wonder if adding a LLM Contaminator test as described here https://lmsys.org/blog/2023-11-14-llm-decontaminator/ would help keep the leaderboard clean

If I understand correctly, this seems to only apply to data filtering and not discrimination of the model.

I'm in favor of unflagging this model, while fblgit might not be comfortable with sharing his UNA Training Method yet, it is highly unlikely to introduce Benchmark data, and i don't doubt that he checked all of his training data thoroughly. He also went trough so much effort to prove that the contamination-checker also finds contaminated data in the base Yi-34B Model, disproving all evidence that he cheated. Please unflag the model @clefourrier , or flag all models based on the Yi-34B Model.

I know this is a sentiment that has been repeated again and again and again, but...

we really need a black-box benchmark. And not just black-box, but specifically for chat models, using each model's preferred chat format for multi-turn evaluation, similar to MT-Bench.

And, just my opinion, but Huggingface is in a unique position to step in and do this, because we can trust Huggingface, and this very leaderboard already serves as the de-facto standard benchmark suite for open models. Whether this was the original intention of the leaderboard or not, people check it daily to see what is the new SOTA of open chat models, but it encourages dubious behavior like "UNA", strange model-merging voodoo, etc etc.

This is not about chat models, it is about open language models. Besides there is nothing dubious about UNA, the only reason for the flag was the suspicion of contaminated training data in the model, which should be applied to all Yi-34-B Models if true (very likely), as it was inherited from the base model, or to no model which doesn't introduce extra contamination. (Like this one).

Edit: Besides, Model merging is a VERY open Technique and has no reason to be seen as dishonest "voodoo".

deleted

@nlpguy I haven't the foggiest idea whether or not this is a valid flag, but he does seem be cooperating and making an effort. He was mean at times, but so were a lot of people towards him.

I'm in favor of unflagging this model, while fblgit might not be comfortable with sharing his UNA Training Method yet, it is highly unlikely to introduce Benchmark data, and i don't doubt that he checked all of his training data thoroughly. He also went trough so much effort to prove that the contamination-checker also finds contaminated data in the base Yi-34B Model, disproving all evidence that he cheated. Please unflag the model @clefourrier , or flag all models based on the Yi-34B Model.

it's got a 99% chance of being contaminated on GSM8K.

What's the problem with you? A contamination with a confidence of 99% should be... unflagged?

image.png

Considering such a new account, I think it may be another trumpet of fblgit.

My name is Leo, and I'm not a made up persona of fblgit, please stop antagonizing people on the internet and treat them like actual people. fblgit proved that the base models he finetuned on experienced a similar contamination with around 95% confidence. So there is no reason to put the blame on him and suspect that he added additional contamination.

I am against unflagging the model.

Intentional or not, if it is contaminated indirectly or directly, it should be indicated as such. No one should not get a pass for using contaminated base models, datasets, or models in merges and claiming ignorance, otherwise you are just opening the doors to people doing so intentionally (using those that exist, or using alts to upload them and use them) and flooding the board with even more inaccurate scores.

Anyone suggesting unflagging has NOT thought through the implications. Model trainers and mergers are responsible for vetting the models they use and the datasets, ignorance is not a pass for allowing such models to be left on the leaderboard as though they are completely above board.

So convince me why a contamination with a confidence of 99% should be unflagged?

No unflagging. This is too good to be true.

@deus-ex-machina

I respect your opinion, that's why i suggested unflagging this model OR flagging all models (including this one) based on Yi-34B which is likely to contain contaminated data itself.

Edit: The same goes for @XXXGGGNEt and @migtissera , and anybody else pointing this out.

@Phil337 sorry if i was mean to you, its not easy to be publicly fingerpointed when in reality i have not done anything wrong.
The ways and means of some folks slamming to get the code is something I don't appreciate. I do try to support the community in many ways.

The flag must stay, and the provided extra evidences are the foundation of a largr problem ongoing and latent problem that is much more wider that how this started.

@nlpguy

I'd agree with flagging all verified contaminated models, as they too are distorting and diluting the usefulness of the leaderboard.

I think the process as it stands needs to be improved, with checks happening up front for contamination, instead of people suspecting it and flagging it after, doing so after the fact results in drama and damage being done the longer models are able to sit out and be leveraged by others for further merging or training, and there are probably many cases where contaminated models slip through because no one flagged them.

Maybe it isn't cheap compute wise to test them all upfront, stricter standards regarding cards and submission information explaining what the model is exactly and how it was created could help filter out a lot of noise and load.

deleted

@fblgit I'm glad you think the flag must stay, and agree that there's a broader contamination issue at play.

No one wants your code, we can never get gold from brass, just keep your secret methods rotting in your stomach, cheater.

I wonder if adding a LLM Contaminator test as described here https://lmsys.org/blog/2023-11-14-llm-decontaminator/ would help keep the leaderboard clean

If I understand correctly, this seems to only apply to data filtering and not discrimination of the model.

Correct, sorry, not the depth I was thinking. This video, https://www.youtube.com/watch?v=dxH1GFCfdF0 put me onto the concept and linked to that page. The more important part was this paper https://arxiv.org/abs/2311.04850 referenced in the video that discussed various methods of detecting contamination in both training data and in models even without access to training data

edit
This thread goes real fast, there no way to delete a comment, and this one is not relevant anymore...

@mantafloppy see my comment towards deus-ex-machina.

image.png

image.png

Funny, many empty/new accounts in this thread.

@XXXGGGNEt my Account was created approximately 6 months ago, that it existed for at least 3 can be seen from the sleeping space I created. You cannot tell the age of accounts by their activity.

The paper https://arxiv.org/pdf/2311.04850.pdf goes on to Rethinking Benchmark and Contamination for Language Models, summary diagram:
rethinking benchmark and contamination.jpg

For a sweaty old man you have a lot of rage, to a point that you became unusable in this conversation since you are too biased and providing accusations wildly without a compelling argument despite the one of you wanting the code to do something with it and reproduce it.

Despite @fblgit doing a valuable research, that you may not believe or think that is a hoax dunno why?? no remorse or disregard to affect thousands like me who wants the source code to continue doing experiments and having fun. My name is vAlerio and im not a tumpet.

If you don't want the code, then is because the code is not the problem, but I thought this thread was started in that way, can you clarify ?

fblgit loves space before a question mark, and he is the only one doing this in the thread. It would sound arbitrary on my part to say that their speech patterns were consistent.

Remember to let GPT polish your trumpet speech next time.

image.png

image.png

image.png

You wouldn't focus on this minor detail if you had any other evidence. As said, please stop antagonizing people, fblgit showed no sign of ill intent and apologized for his partly justified outrage at this unwarranted accusation, yet you keep labelling him as a cheater and me and valerio as bots.

we have a john nash here.. u bro are crossing some line of sanity XD
Im not sure if the thread is about the model, the algo, the dozen other flagged models or what ?

Note to Huggingface staff: Maybe clean up this comment section for the sake of civil discussion. This back and forth with useless space-consuming screenshots of profile pages and comments doesn't add anything to the discussion.

This thread is about all this model and if all of the others based on YI-34B should be flagged too. Most of us (except one), don't doubt your technique right now.

That Google had to invent a new evaluation method (CoT32) to slightly bump up their MMLU above GPT-4 for marketing purposes, has me seriously doubt the integrity of the myriad "fine-tuned" models here that claim obnoxious increases on hours' worth of compute.

And I will not point fingers in this new benchmark-rigging venture, but for those in the know, the Leaderboard had depreciated in utility long ago. There's little doubt in my mind that EVERY popular datasets lately has been contaminated in some way (datasets may even contain rephrases of the same samples from the test sets). People take contaminated models, merge or alter them in some way (sometimes unwittingly); rename them as openchat, go-bruins, marcaroni, leoscorpious or whatever they please; then someone else will take those, and merge them again. The cycle continued for so long, and scores kept creeping up, obscuring the actually good models, until the point where literally no one can trace back the source of contamination anymore. These merges simply increase in score because they increase in test data saturation, nothing more.

I am for a holistic audit of all top models, when and if the technology allows. Right now, the Leaderboard is of help to no one, but the people who want to perpetuate this charade.

This thread was not discussed in good faith at all. However, I guess it does take fiery dissent to catalyse changes.

I agree with the fact that some of the models on the leaderboard were rather lazily put together, and that the effort taken doesn't always reflect the actual improvement. But to say that every popular dataset has been contaminated to the point where it massively influences benchmarks is a over-exaggeration imo. The leaderboard is not an automatic benchmark pipeline designed to select the model at the top from, but if you take a closer look at the models put on there you can select the best one with relative ease. (given enough patience)
Or just select the big ones released by mistral, llama, Qwen. (Although we can't be 100% sure with those either, as seen with the Yi-34B Base Model)

TLDR: The Leaderboard is of help for me, when carefully analysing the models presented.

Many Chinese models are known to not reflect a quality proportional to their numbers. I'm talking about Xverse, Skyworks, Baichuan, Deepseek, and a whole collection of other models that had fallen by the way side. I don't see why Yi should be exempt from such suspicions. It's made by a startup, obscure as any.

That Yi is contaminated is of no surprise to me. The community was hurting for a proper 34B for too long, and for a moment, Yi seemed as decent as any. But if anyone with eyes had tried the model: no, it is not better than Claude, nor Turbo, nor L2-70B, despite the benchmarks indicating so.

Now, I don't doubt one could pick a top model from the Leaderboard and have relatively better success than a bottom-ranked one. However, if the Leaderboard misrepresents the actual ranking of the models' quality, then it's doing more harm than good. If I were a layman navigating the page, I would be readily misled by the impression.

Regarding datasets, it's a well-known fact that all of the high-profile ones are, to a degree, contaminated. SOLAR, the top-ranked model atm, specifically claimed to filter for tasks that overlap with the test data, for example. I think we all can appreciate such openness, at least. To clarify, I don't believe fblgit intentionally "cheated", nor do I believe his UNA tuning method will put Google's Deepmind to shame. He merely presented himself too informally that people suspected his fair efforts, and stumbled upon contaminations so jarring that it bred animosity.

That said, root or symptom, the model is compromised and should be flagged as such.

In my opinion, contamination tests need to be run before they are displayed on the leaderboard. While this leaderboard could be a gold standard for seeing the progress of local llms, it's currently only good to browse through for hours before finding one of genuine quality.

Uh, how much of the flaming in this thread was bot-generated? Looking at fglbit's linkedin and some of his model cards, it's clear that English is not his first language but he seems unaware of how bad his English is, in that whatever the intent behind his statements in English, they present -- no offense, fglbit -- as childish, hostile, and provocative to a native English speaker, but probably aggressively hostile and offensive to a non-native English speaker.

https://huggingface.co/fblgit/una-cybertron-7b-v2-bf16

What is NOT UNA? Its not a merged layers model. Is not SLERP or SLURP or similar.

What is UNA? A formula & A technique to TAME models
When will be released the code and paper? When have time, contribute and it'll be faster.

Model Description
Developed by: juanako.ai
Author: Xavier M.
Investors CONTACT HERE

@fglbit Your English is very ... confusing, in a way that is guaranteed to be near impossible to read for others who do not speak English. You seem capable of writing complex domain-specific sentences legibly but deteriorate into nonsense when dealing with less specialized writing. If you were a model, I'd guess you were a Yi that had been fine-tuned with translated-to-English subtitles.

Many Chinese models are known to not reflect a quality proportional to their numbers. I'm talking about Xverse, Skyworks, Baichuan, Deepseek, and a whole collection of other models that had fallen by the way side. I don't see why Yi should be exempt from such suspicions. It's made by a startup, obscure as any.

That Yi is contaminated is of no surprise to me. The community was hurting for a proper 34B for too long, and for a moment, Yi seemed as decent as any. But if anyone with eyes had tried the model: no, it is not better than Claude, nor Turbo, nor L2-70B, despite the benchmarks indicating so.

Now, I don't doubt one could pick a top model from the Leaderboard and have relatively better success than a bottom-ranked one. However, if the Leaderboard misrepresents the actual ranking of the models' quality, then it's doing more harm than good. If I were a layman navigating the page, I would be readily misled by the impression.

Regarding datasets, it's a well-known fact that all of the high-profile ones are, to a degree, contaminated. SOLAR, the top-ranked model atm, specifically claimed to filter for tasks that overlap with the test data, for example. I think we all can appreciate such openness, at least. To clarify, I don't believe fblgit intentionally "cheated", nor do I believe his UNA tuning method will put Google's Deepmind to shame. He merely presented himself too informally that people suspected his fair efforts, and stumbled upon contaminations so jarring that it bred animosity.

That said, root or symptom, the model is compromised and should be flagged as such.

I too was dissapointed by Yi. Real world usage tells a different story, with Yi having difficulty following instructions and not being much better than the top Mistral fine-tunes. Subjectively, it feels worse than most 70b models.
Can't prove the Yi contamination was purposeful; however, I too have been burned by many Chinese models and unfortunately, I am now suspect of any Chinese model.

Guys I'd encourage you to test the newest UNI model released by the creator. If its faked its still a DAMN good model and nearly on par or better than mixtral from my evaluations of logical questions and writing skills

deleted

@dillfrescott I agree. The highest scoring Mistral in my testing was SuperMario v2, which is a slerp that includes a UNI model. The UNI model by itself had blind spots (e.g. story telling), but was a solid perform in certain areas. And the bump in scores was minor, so the contamination was nothing to write home about. It's at least a dozen times smaller than the artificial bump Tigerbot chat got.

At this point I would like to thank Huggingface for the leaderboard and the platform and everyone contributing to the community!

BIG THANKS!

Even though the leaderboard is not perfect or misleading, combined with the facts you guys share in discussions (also this one) it's helping me getting a better view on opensource models. Opensource can give you hard times, but with awesome guys like you, who contribute to the community, it's worth it.

Just looking at a leaderboard won't let you pick the perfect model for your very specific use case anyways. So you have to dig deeper. Dig into discussions like this to pick out the useful facts for yourself. Facts that you guys generously share with the community. So again, thank you!

Open LLM Leaderboard org

It's super kind @r000bin :)

Open LLM Leaderboard org

This discussion has become very long, so I'll try to post a summary and discuss next steps we are considering, leaderboard wise.
In the meantime, I just wanted you folks to know that you can flag non proper comments (especially when they are just ad hominem attacks that disrupt the conversation flow): it's on the left of any given comment, under report - it calls our moderators to check out the message.
I'll also tag @lunarflu , who might want to look at this discussion from a moderation point of view .

Open LLM Leaderboard org

The original discussion was open because of suspicions of contamination of fblgit/una-xaberius-34b-v1beta on GSM8K data, a model fine-tuned using the UNA method from base models bhenrym14/platypus-yi-34b and 01-ai/Yi-34B.
We used a contamination detection tool and found out that the above model needed to be flagged, because it has a really high probability of being contaminated by GSM8K (0.99).
The author then ran the contamination code on the base models too, and found that they could be contaminated too, though with a lower score (0.94) (we are in the process of double checking that internally and will of course take action).

Open LLM Leaderboard org

A lot of other good points were raised, that I will try to address below.

Open LLM Leaderboard org
โ€ข
edited Dec 15, 2023

What about the MetaMaths models?

  • There was actually a very nice analysis done here by a user using the contamination tool. Digging deeper into how MetaMath was created (= part of the method is GSM8K prompt rephrasing), I would assume that all models trained with MetaMaths are contaminated (at least in part) on GSM8K.

Can we add the LMSYS decontaminator ?

  • Their technique is actually a technique for the training data rather than something you can apply to models once we have them.

Can we add contamination checks to the leaderboard?

  • At the moment, we are talking in depth with the authors of the contamination detection tool we used to better test the possible limits of the tool - if it works in practice as well as it seems it should in theory, we'll likely add it or add a way for people to run it easily on submitted models. We are also investigating several other contamination detection methods, so if you have any reference that you find interesting, I created a thread for it here, add your resources!

Can we be stricter on model cards?

  • We have already added a constraints which requires models 1) to have a model card of a minimum length and 2) to have a model license. For the moment, we have not found a satisfactory way to constrain the submissions more.

We need blackbox benchmarks

  • That is definitely true, and we are working with partners on this. However, if you belong to a lab and have a cool dataset with a val/test set, that you'd like to setup as a leaderboard, feel free to ping me and I'll help you set it up ! I'm also working on making setting up leaderboards on the hub considerably easier, so you'll find leaderboard templates available if you want to give it a try.
Open LLM Leaderboard org
โ€ข
edited Dec 15, 2023

And, last but not least, given all the things evoked above, you might wonder what is the leaderboard for?

Our initial vision for the leaderboard was the following:

  1. to make it entirely reproducible, so it becomes an objective source of information (that's why it's not blackbox) - anyone can reproduce our results at home
  2. to allow people to discriminate easily between pretrained models, on high quality benchmarks used by researchers and engineers alike
  3. to allow people doing research or developing models to get an idea fast about how well method 1 is working compared to method 2
  4. to make it a good resource for research - we publish absolutely every output generated by absolutely every model we evaluated, so I was kind of expecting people to do big meta analyses of model families performances and trends on different benchmarks.

I think all of the above points are still valid.

However, in the meantime, the leaderboard became a worldwide recognized ranking (which we did not expect ๐Ÿคฏ). We welcome all suggestions on how to make it better (though we don't have time turners yet, so features will take some time to arrive!) - feel free to create discussions for new important features you need.
I think it must be combined with other resources (such as the chatbot arena for chat models) to provide a full view of what models can do, and that's also why we think it would be great if other leaderboard existed to give a more complete view of models.

As the Leaderboard grew in size and repute, its rankings had been advertised as a tool beyond its initial design. Many startups and AI ventures have built entire sales pitches centered on their placement on here. It's prudent then, that we update its methodologies to better serve the community, with more transparency, legibility, and differentiability.

I think proper analyses of model families is a good start, for in my opinion, it is the largest and most obvious discriminant between models. People are jaded of the bi-weekly 7B model that claim to "beat" top 70B offerings, only to find out the hard way that no such unicorn exists, and if and when such unicorn eventually turns up, it would inevitably be lost in the muddle.

Argumentation aside, I'd like to mention my love for the Leaderboard, as yet flawed as it is, and extend my thanks to every community staff that had a hand in maintaining it. Hobbyists, enthusiasts, and even professionals alike have flocked to it, like children to a toy store; and I have no doubt it had played no small part in spurring real, fiery competition within the LLM sphere.

Proof 0.99 screenshots doesn't match. Pandora's ain't My fault... This was know and allowed for AdWords

Are we still going? Indulging in conspiracy and threats of doomsday because of more scrutiny, finding base models are contaminated, and may end up flagged? Absolute nonsense.

It is entirely reasonable for model creators, however they claim to create anything, even via magical means that don't require fine-tuning or merging supposedly, to vet any models or datasets they use. Ignorance isn't an excuse to evade flagging or critique.

And just going to say this, all models going forward from past questionable actors should be subject to additional up front scrutiny pending a solution to automate it for all submissions.

Many Chinese models are known to not reflect a quality proportional to their numbers. I'm talking about Xverse, Skyworks, Baichuan, Deepseek, and a whole collection of other models that had fallen by the way side. I don't see why Yi should be exempt from such suspicions. It's made by a startup, obscure as any.

That Yi is contaminated is of no surprise to me. The community was hurting for a proper 34B for too long, and for a moment, Yi seemed as decent as any. But if anyone with eyes had tried the model: no, it is not better than Claude, nor Turbo, nor L2-70B, despite the benchmarks indicating so.

Now, I don't doubt one could pick a top model from the Leaderboard and have relatively better success than a bottom-ranked one. However, if the Leaderboard misrepresents the actual ranking of the models' quality, then it's doing more harm than good. If I were a layman navigating the page, I would be readily misled by the impression.

Regarding datasets, it's a well-known fact that all of the high-profile ones are, to a degree, contaminated. SOLAR, the top-ranked model atm, specifically claimed to filter for tasks that overlap with the test data, for example. I think we all can appreciate such openness, at least. To clarify, I don't believe fblgit intentionally "cheated", nor do I believe his UNA tuning method will put Google's Deepmind to shame. He merely presented himself too informally that people suspected his fair efforts, and stumbled upon contaminations so jarring that it bred animosity.

That said, root or symptom, the model is compromised and should be flagged as such.

I too was dissapointed by Yi. Real world usage tells a different story, with Yi having difficulty following instructions and not being much better than the top Mistral fine-tunes. Subjectively, it feels worse than most 70b models.
Can't prove the Yi contamination was purposeful; however, I too have been burned by many Chinese models and unfortunately, I am now suspect of any Chinese model.

If you know anything about Kai-Fu Lee, then you're not surprised. This guy is a master of self-promotion. YI is nothing more than a story fabricated for billions of dollars in investment. The same thing often happens with these Chinese models that have incredible quality with very few parameters.

@rinoa Yi-34b isn't a sham, even if it's compromised in some ways.

  1. It only has 34b parameters, yet answered some of my esoteric knowledge questions correctly (similarly knowledge performance to Llama 2 70b).

  2. It's crippled by not only being bi-lingual, but the second language being Chinese, which requires an immense token dictionary and is a notably handicap.

  3. It's English language skills are a little lacking, but this is reflected in the appropriate LLM tests. For example, both Winogrande and HellaSwag are lower than Llama 2 70b. If they were going to cheat, this is where they would have most likely done so.

Overall, Yi-34b scores are about where they should be and certainly aren't way off.

Open source community also means cherishing other people's success as your own.
If the model is a fake it will be clear by its performance anyhow.
Stop chasing fame

Sorry for this off topic response, but I have to say it.

With all the plot twists, protagonists, antagonists, flame wars,... etc, this thread could be made into a pretty good movie...
I'd watch that ๐Ÿ˜…

On topic, I think that a lot of us are not being objective here, also I really value the respectful role the huggingface team is playing here.

@rinoa Yi-34b isn't a sham, even if it's compromised in some ways.

  1. It only has 34b parameters, yet answered some of my esoteric knowledge questions correctly (similarly knowledge performance to Llama 2 70b).

  2. It's crippled by not only being bi-lingual, but the second language being Chinese, which requires an immense token dictionary and is a notably handicap.

  3. It's English language skills are a little lacking, but this is reflected in the appropriate LLM tests. For example, both Winogrande and HellaSwag are lower than Llama 2 70b. If they were going to cheat, this is where they would have most likely done so.

Overall, Yi-34b scores are about where they should be and certainly aren't way off.

Sure. I didn't mean to imply it was a sham. I am just pointing out that Yi doesn't live up to the initial hype of besting the 70b models.

Yi Lab Is a victim. This goes way up in the food chain.
Misinformation.

I am just pointing out that Yi doesn't live up to the initial hype of besting the 70b models.

Strange. I mean, I personally don't have the specs to try either of them so I don't know, but I thought Yi-34b was better than llama-2-70b. The lmsys chatbot arena says that people prefer base Yi a good amount more than base Llama-2-70b. Is it a different story with fine-tunes?

Also, people prefer Yi-34b over Gemini Pro lmao. rip google.

I am just pointing out that Yi doesn't live up to the initial hype of besting the 70b models.

Strange. I mean, I personally don't have the specs to try either of them so I don't know, but I thought Yi-34b was better than llama-2-70b. The lmsys chatbot arena says that people prefer base Yi a good amount more than base Llama-2-70b. Is it a different story with fine-tunes?

Also, people prefer Yi-34b over Gemini Pro lmao. rip google.

no one prefers chinese models over Gemini pro or even mistral.
where individual researchers are making absolute masterworks, we have these models that we all are divided about. I also don't believe the quality of these chinese models, they are always performing really weakly, even starling, even tho they are apparantly so good with evals. well guess what, if you tell the model how to solve it even if it's 2 nuerons it will solve it perfectly.
I'd like to remind you we didn't have such division and weird problems with other models. it's bad for the health of open source to have fake numbers and fake models and fake players, it's not a third degree product anymore, we will review and voice our opinions. don't like it keep it inside china. nobody likes your half worked models.

The only leaderboard I really trust is the chatbot arena since you can't really cheat it. The issue here is that it only has so many models, and the quality of data is dependent on how many people are providing their input in the arena. It's already been said here a couple times, but we really do need more black box tests, sooner than later since there are so many models to test and retest.

no one prefers chinese models over Gemini pro or even mistral.

I do. I fine-tuned lots of times the 7B Mistral models for my custom reasoning task and I was blown away by the latent reasoning capabilities of Yi.

Pallas-0.4 is miles better on my custom task than my 7B Mistral fine-tunes (not to mention non fine-tuned ones).

Yi Lab Is a victim. This goes way up in the food chain.
Misinformation.

Really? It's the Yi investors who are the victims!

In the last few hours a lot of 7B models have surpassed the previous #1, top average is now 74.95. Maybe it is time to use new benchmarks for the leaderboard? People seem to have found a way to cheat the current system, how can the top 3 models be 7B parameter?

deleted

@tarruda It's not the fault of the current benchmarks. And I don't think adding new benchmarks will help. But it's definitely time to act. Perhaps it may help to filter out the mergers and piggy-backing fine-tunes by default, then providing an option that users can click on to include them in the results. And perhaps also punishing accounts that are flooding the system with obvious nonsense, such enacting a temporary freeze on uploads by them and any freshly created accounts.

It is simply time to flag all models that have been trained with Meta-Math data. It has now been publicly communicated that Meta-Math has rephrased the GSM8K data set...
Or what is the problem with the decision to flag this dataset as contaminated?

deleted
This comment has been hidden

I question many things in this flagged thread and sense the heavy presence of bots at play. There is a certain 4chan childish rhetoric about many "accusations" which are clearly intentionally emotionally charged. We could all see that Yi, if improved through training, had the potential to compete with GPT4. Every human reading this should seriously contemplate why there might be some good reason for certain entities to put Yi to rest. Yi was not created by some random Chinese start up, the guy belonged to Google China. Use your big brains, and if you have a big GPU compute, use it to continue to see if Yi can be improved with developing methods. Flag or no flag, never 'trust the science" in this case benchmarks, question everything. I have put Yi to solid use over many weeks for iterative error correction in combination with a programming environment and other research scenarios and found it to be exceptional, better than the new 7B models. I'm less concerned about the 95% probability of contamination and more interested with the remaining 5% possibility.

@tarruda the 7B models are easier to train so more people are motivated to do so. Furthermore, all the 7B's have a hefty tilt thanks to GSM8K. Yi would be miles ahead if it had the same GSM8K score trained into it. Shame if Yi was put to bed and that training never happened (shame for the opensource community that is, not so good for big tech).

I question many things in this flagged thread and sense the heavy presence of bots at play. There is a certain 4chan childish rhetoric about many "accusations" which are clearly intentionally emotionally charged. We could all see that Yi, if improved through training, had the potential to compete with GPT4. Every human reading this should seriously contemplate why there might be some good reason for certain entities to put Yi to rest. Yi was not created by some random Chinese start up, the guy belonged to Google China. Use your big brains, and if you have a big GPU compute, use it to continue to see if Yi can be improved with developing methods. Flag or no flag, never 'trust the science" in this case benchmarks, question everything. I have put Yi to solid use over many weeks for iterative error correction in combination with a programming environment and other research scenarios and found it to be exceptional, better than the new 7B models. I'm less concerned about the 95% probability of contamination and more interested with the remaining 5% possibility.

Hi bot, Yi's boss Kai-Fu Lee was shown the door by Google a long time ago. Lee has never been an honest man, of course, that's also a common trait of all businessmen. His Yi model was found to use Llama's architecture. Additionally, his team renamed some tensors to cover up the fact. You never use your brain because bot never has one.

Okay, donโ€™t get personal here. I like Yi and will be releasing my models trained on it. Synthia and Tess are both my models. I think this is probably due to some negative sentiment towards China. Honestly, even though theyโ€™re peer competitors, in open source we embrace competition because it advances us all.

On a serious note, how the hell are 7Bs topping the leaderboard?! This has to be addressed to restore faith in this leaderboard..

I just don't see any way to fix this leaderboard other than having new black box benchmarks. If huggingface tries to remove all contaminated models then they're literally going to remove half or more of the models, which would probably include Yi (already shown to be contaminated), Qwen, and deepseek, some of the best base models we have.

deleted

@TNTOutburst Even GPT3.5, GPT4, Gemini.. are contaminated. It's inevitable despite sincere attempts to avoid it. But there should be a reasonable effort to clean public fine-tuning databases like nectar.

@rinoa , again on the level of 4chan mentality. This reconciles any remaining doubt to myself and every other human present that there is some very targeted agenda at play. A very typical response to attack the character of person/s not the subject matter. All humans reading this have no doubt taken reference. @TNTOutburst hmmmm, a seemingly valid reason to remove and discredit many of the highest performing models . . . nothing to see hear folks, carry on (Open Source models burning in background)

deleted

I don't even think GPT-5 could extract anything useful from this discussion.

@Phil337 yes, everyone look away and don't give this discussion a second thought. Nothing to see hear, not worth your time and effort, turn around and look the other way.

deleted

@catalystman You're just a troll. Your account is virgin other than this thread. There's no conspiracy. Open source models are progressing astonishingly fast. Just last year they couldn't even solve simple problems, yet some are now solving problems that 375b GPT3.5 usually gets wrong. They wouldn't be able to do this, along with countless other things like fixing user code, writing prompted stories, summarizing random papers... if their test scores were a sham. The proof is in the pudding.

My guess is you got caught cheating and came back here with a different account to angrily accuse everyone else of cheating.

Close this discussion please, it's spamming my inbox..

Discussion will be closed when fair equality is performed and all contaminated models gets flagged and not just 1. I'm waiting HF to do something before I open hundreds of flagged issues...

deleted
This comment has been hidden

@phil337, I'm a concerned citizen at best, I have no models of my own nor has any of what I have said displayed anger, I'm but a proactive user. I'm concerned because I use Yi and Deepseek with tremendous success and In my use cases(advanced signals processing programming and design) they have outperformed all others of similar or smaller size(however, the mistral base is still impressive) . The metrics I've seen presented by the creators for these 2 models match my experience. I fear some in this thread are focused on discrediting Chinese models and possibly destabilizing the open source community. I shall keep eating my pudding, and hope the chef's keep improving these dishes and aren't demotivated by propositions in this thread that seemingly give no real bearing in practical application. If i was into conspiracy Phil, I would probably be talking about the significance of the number 4 in Chinese numerology(bad luck) and the coincidence of this discussion being #444.

@fblgit Do you want me to test your model using my own tests? If your model is really as great as you claim it to be, the tests will show it. Or if you are a dirty cheater, you will get dunked on.

Is there a way to unsubscribe from a HF thread so that new messages don't show in the inbox? I posted a message here and it seems there's no escape from notifications. Would be a good feature to have @clefourrier

julien-c locked this discussion

Hey all! Please read @clefourrier replies above (e.g., https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard/discussions/444#657c12befcba5f698c2e3fed).

This discussion is going beyond the initial scope of the report, with lots of counterproductive comments, so we're proceeding to lock this conversation. As discussed above, we have ongoing work to analyze contaminants at scale. Feel free to flag contaminated models and open new discussions about concrete issues or ideas.

Thanks, everyone, for contributing to improving the Open LLM Leaderboard, and have a llamastic day! ๐Ÿค—

Sign up or log in to comment