open-llm-leaderboard/open_llm_leaderboard · Where did all of the Llama-1-65b fine tunings go???

Jul 23, 2023

This is kinda crazy, and hurts the credibility of this leaderboard, given that llama-2-70b appears on par with llama-1-30b fine tunings.

Please put them up as soon as possible, as this looks disreputable.

If Llama-2 scores could be posted overnight, then Llama-1 65b fine tunings could be re-posted almost immediately as well, correct?

I'm assuming that there is a new MMLU configuration that lead to them being taken down? But it is no explanation for why they should be excluded during such a seminal, and consequential, moment as the release of Llama 2.

Please repost/recalculate POST HASTE!!!!! This is bad.

spaceman7777

Jul 23, 2023

And what is going on with this? How in the world have guanaco 65b, wizardLM 30b, and wizard 13b dropped to the wayyyyy bottom of the list? These are some of the highest quality models on hugging face.

Was the dashboard eval tests tuned to evaluate and boost the scores of llama-2?

Truly baffling. I'm not sure why this dashboard has even been posted with such grievous and obvious errors on it

Wubbbi

Jul 23, 2023

•

edited Jul 23, 2023

Unfortunately, the leaderboard is all over the place and, as it seems, currently halted.
I just hope we get current and accurate results sometime soon. I really would like to use this leaderboard to have a look at LLM progression in the last months.

But @clefourrier knows better what the current status of the leaderboard is and how far fixing is.

spaceman7777

Jul 23, 2023

It just seems really disingenuous to update the leaderboard as soon as llama-2 is released, put it at the top, and then turn it back off. @Wubbbi @clefourrier

I don't think any other places would accept leaving such prominently displayed inaccurate data in need of a retraction, without a notice that there are currently severe issues with it.

The same should likely have happened after the MMLU incident, but right at the launch of llama 2... that's really a bridge too far, and looks really bad.

This certainly shouldn't be advertised on the hugging face website as accurate in its current state, considering how it is discriminating significantly against the academic and commercial work of a massive number of individuals, and biased toward Meta's interests (for whatever reason).

@clefourrier Please consider retracting this leaderboard until the issues are resolved, as it is having an outsized effect on the future of Artificial Intelligence, given HF's prominence.

Wubbbi

Jul 23, 2023

@spaceman7777 I agree with you.

clefourrier

Open LLM Leaderboard org Jul 24, 2023

•

edited Jul 24, 2023

@spaceman7777 Thank you for your interest!
Which models are you specically talking about? We have re-ran a lot of the llama models due to a tokenization problem which was identified a week ago, hence why they were removed from the leadeboard, and we communicated about this here. However, they should all be back, please let us know if you find that a specific model disappeared!
FYI, LLaMA 2 scores were not posted "overnight", we actually partnered with Meta to have their model on the leaderboard on the day of the release, and the evaluations actually took way longer than just a day :)

@SaylorTwift is investigating what happened with the Guanaco models - it seems to come from the last update we got from the harness - we'll keep the community posted!

Update: We did turn off the queue over the weekend because we're going to publish an update to the leaderboard options, which meant changing our entire saving file system - it was badly timed with llama-2, we did not foresee that it could be interpreted this way.

Wubbbi

Jul 24, 2023

•

edited Jul 24, 2023

@clefourrier I think they mean the two WizardLM models with the red arrow (and the other Vicuna and WizardLM models) that still score in 31.6 while that would be unreasonable for this model.

In my humble opinion, the biggest issue is just that you can't tell which model score is actually accurate or not, and when that score was taken.
As the developer of WizardLM pointed out in another threat here, it is actually pretty bad to score the model with a wrong score to the public.
I know this score is bogus, you know too. But some people come here, search for "WizardLM", and see an avg score of 31.6 and think the model must be "trash".

It's really bad for your name and research team when your advertised model is scored way lower than it actually should be. It's basically "bad business".

I'm sorry, I don't want to sound rude, by all means.

For the purpose of constructive criticism: I would add a new coelom with the date the score was taken.
I would also have a quick look through the list for any unreasonable (e.g. WizardLM 13B 31.6) scores and just rerun the test.

And for the purpose of a clean queue: If you are unable to run GPTQ models, just don't allow them to be submitted.
Every once in a while someone queues a GPTQ model which this test can't run (?), and it's stuck in the queue until manual intervention. This just clogs up the queue and makes it a mess.

Thank you very kindly

clefourrier

Open LLM Leaderboard org Jul 24, 2023

@Wubbbi Thank you for your comments! @SaylorTwift is investigating the WizardLM/Vicuna/Guanaco models but we are quite sure these are the actual results you get when using the leaderboard on these models. Feel free to double check/run the Harness on these models (using the same commit as us) and tell us what you get!
We really understand how frustrating it can be and don't want researchers to feel like their work is badly evaluated.

All models that are on the leaderboard should all have been re-run with the version advertised in About. Hence, we are not planning on adding a column with a "last evaluated" tag, as scores already computed should not move through time (now the Harness has been fixed), whereas when we will add new evaluations, we will pull different results from different files, so it would be confusing to know what the date would refer to.

I don't see what you are talking about for the GPTQ models; do you have a concrete example?

spaceman7777

Jul 24, 2023

@spaceman7777 Thank you for your interest!
Which models are you specically talking about? We have re-ran a lot of the llama models due to a tokenization problem which was identified a week ago, hence why they were removed from the leadeboard, and we communicated about this here. However, they should all be back, please let us know if you find that a specific model disappeared!
FYI, LLaMA 2 scores were not posted "overnight", we actually partnered with Meta to have their model on the leaderboard on the day of the release, and the evaluations actually took way longer than just a day :)

@SaylorTwift is investigating what happened with the Guanaco models - it seems to come from the last update we got from the harness - we'll keep the community posted!

Update: We did turn off the queue over the weekend because we're going to publish an update to the leaderboard options, which meant changing our entire saving file system - it was badly timed with llama-2, we did not foresee that it could be interpreted this way.

So, if I'm understanding this correctly. The leaderboard was adjusted to change scoring criteria, meanwhile Meta reached out and personally collaborated with leaderboard management to aid in making sure their scores were published, whereas no other entities have been allowed to jump the queue, for this adjusted leaderboard. And the former top models have been either mislabelled, removed, or given erroneous scores?

And then posting new scores was halted?

Please issue a retraction @clefourrier . This doesn't look or sound good by any means, and is clearly listing what are essentially erroneous results, with Meta getting to score their model against the new criteria before everyone else.

A valid and reasonable test should also be evaluating questions in the format of the model it's evaluating.

It doesn't make sense to allow a model to work alongside, tuning itself to the new scoring criteria, and then issue the test to everyone else (and not actually everyone, test coverage for the top echelon of models has been extremely limited or botched-- guanaco was #3 or so two weeks ago, and now it's scoring less than alpaca 7b).

This leaderboard needs to be retracted @clefourrier , or a disclaimer needs to be issued.

The onus of issuing the test is on the leaderboard, not on the models to custom finetune themselves to understand the testing criteria.

The models are finetuned to be used in the format they designate, not in the format that a specific leaderboard has chosen.

It is common to cite evaluations that have been produced by using the proper prompt formatting for a model (as without proper prompt formatting, most models do not work). This is akin to trying to pass C# instructions to a python, Java, or Scala interpretter and blaming the problem on the models.

Wubbbi

Jul 24, 2023

•

edited Jul 25, 2023

@Wubbbi Thank you for your comments! @SaylorTwift is investigating the WizardLM/Vicuna/Guanaco models but we are quite sure these are the actual results you get when using the leaderboard on these models. Feel free to double check/run the Harness on these models (using the same commit as us) and tell us what you get!
We really understand how frustrating it can be and don't want researchers to feel like their work is badly evaluated.

All models that are on the leaderboard should all have been re-run with the version advertised in About. Hence, we are not planning on adding a column with a "last evaluated" tag, as scores already computed should not move through time (now the Harness has been fixed), whereas when we will add new evaluations, we will pull different results from different files, so it would be confusing to know what the date would refer to.

I don't see what you are talking about for the GPTQ models; do you have a concrete example?

I wouldn't go as far as @spaceman7777 and call for a retraction or a disclaimer like they call for, but let me just understand this correctly so that I do not sound foolish at this point.

So there are 2 WizardLM 13B models at the bottom of the screenshot, and they are both by WizardLM (its developers) and both scored 31.6 points on average. Okay, I understand that.
You also say that these results are correct and already reevaluated. I believe you. 100%. But it doesn't make any sense to me.
At the top of the screenshot, you can see the same models, just reuploaded by someone else, that score 20+ points more.
Now let's talk about the valid possibility that the WizardLM/WizardLM model is broken:
My question would be: How can the original, by developer, model be broken when all the other models, that are basically copies of it, work?
Wouldn't it make more sense that all other models be broken too? Like ... I don't understand.

The next thing is that 15B Guanaco model with the "?" that I put in.
It is scored 30 points lower than even expected. I don't doubt any word you say, but it looks just wrong to me.

I mean, I'm sorry, I really don't want to sound foolish or try to tell y'all how to do your job and all the hard work that I really appreciate. But does that make sense to you?

I think maybe the config is wrong?! Are these models deltas or full models? Do they be run as deltas or full models? Are they evaluated as full models, but they are actually deltas and that's why their result is a mess?

Regarding GPTQ: If you look at the queue of this leaderboard, then you can see several GPTQ models being in that queue right now. Now, I am not saying this is wrong. I like that! But can they be run? Because if they don't then they probably shouldn't even be allowed to be submitted to not clog up the queue with models that can't be run and make everything a mess. If they do, then I'm happy.

Thank you so much for your time and listening.

clefourrier

Open LLM Leaderboard org Jul 24, 2023

•

edited Jul 24, 2023

@Wubbbi Thank you for your very helpful comments!
I just checked the "Finished Evaluations" queue (which is in the Submit tab, btw), and behold!
Many of the Wizard models have been submitted as Original weights, not Delta weights, which is likely the source of the bad results!
For those which were submitted both as Original and Delta, we likely have both versions of the files, and we'll just have to remove the non-delta results from the leaderboard.

I'll also add a column with weights type to the main leaderboard, to avoid these kinds of misunderstandings in the future.

Re: the GPTQ models, would you be so kind as to open a new issue so I can check this tomorrow?

spaceman7777

Jul 26, 2023

@clefourrier great job on this! Attaching a screenshot from last night, showing that a bunch of top of the line llama-1-65b fine tunings are now showing on the leaderboard!

I also did a cursory look over the rest of the leaderboard, and it looks like the other problems noted by myself and others users have been resolved as well.

Again, great job, much appreciated keeping this up to date and an accurate resource for language model researchers :D

clefourrier

Open LLM Leaderboard org Jul 26, 2023

@spaceman7777 @Wubbbi
Glad this issue was resolved! Closing.

clefourrier changed discussion status to closed Jul 26, 2023