HuggingFaceH4/open_llm_leaderboard · Queue is currently very slow

25 days ago

•

Great, nothing is getting out of pending and everything is jammed again. Can you guys block models like this or something? Its always the big-ish models. Or at least put big models in a seperate queue or something.

nlpguy

25 days ago

•

edited 25 days ago

Everyone has a right to submit models to their leaderboard, first come, first serve. It wouldn't be fair to ban certain models or people, or favour certain models based on size or originality. This is the OPEN llm Leaderboard, regardless of how you may feel about it.

Phil337

24 days ago

@Walmart-the-bag While everything @nlpguy wrote is true, I happen to agree with you in spirit.

It annoys me when people create experiment series and clog up the evaluation queue in an attempt to over-fit the leaderboard's tests, usually resulting in almost useless Mistrals that score higher on the leaderboard than far more powerful models like Mixtral 8x7b, or even Mixtral 8x22b.

However, in this case the user isolated each of the Mixtral 8x7b experts and is testing them individually. I think breaking up an MOE's experts is too destructive to ever prove useful, but it's an interesting experiment.

Walmart-the-bag

24 days ago

•

edited 24 days ago

Pretty annoying that it clogs everything up though. Its indeed an interesting experiment but the pending was at ~110 and it was manly those models

There is also an issue where models are duplicating? hopefully that doesnt mean its gonna run it twice.

Walmart-the-bag

24 days ago

Right now, still stuck above 100 lmao

Phil337

24 days ago

@Walmart-the-bag I'm with you, it's very annoying, especially when I'm keeping an eye on several of them after the release of 8x22. However, the cards cost 10s of thousands so it's perfectly reasonable to put evaluations on hold when resources are needed for things like inference and training.

CultriX

23 days ago

Isn't there a maximum of models one user can submit in a given time window?
Atleast I've gotten that message in the past (shameless admission here lol).

MaziyarPanahi

23 days ago

I would love to chime in here! I think it would be great to have some rules in submitting models that you don't own:

only the authors of the models should be able to submit them

reasons:

I create a lot of models, and I don't necessarily do them to be on any leaderboards. (there is no way to stop others unless I remove the license from the model and that breaks the search/discovery)
When I do want to put something on the leaderboard, I cannot! because someone just submitted 11 of my models in the past 7 days. Now I have to wait, hope nobody else does the same in upcoming days.

If that's a bit restrict and unfair to limit the submission only to the author(s), maybe we can at least not allow others to (re)submit a model that is either already on the board or it's way too old to bring any value:

This model is 7-month old! nothing changed and nothing will ever change, even if the queue is empty, why waste valuable resources. (also, it comes from the people who invented the llm-eval we use here, they have detailed benchmarks in the model's card)

clefourrier

Hugging Face H4 org 23 days ago

Hi all!
Our research cluster was very full this weekend (not going to spoil the surprise, but cool things are coming up soon from other teams :D). In those cases, the evaluation queue gets put on hold.

put big models in a seperate queue or something.

The size of models is mostly uncorrelated with the queue speed - all models are only allowed one single node, and we use as many nodes as available. If more important things run on the cluster, evaluation jobs are cancelled and re-scheduled.

There is also an issue where models are duplicating? hopefully that doesnt mean its gonna run it twice.

"Duplicates" are usually models submitted in different precision, which can be interesting for the community.

Everyone has a right to submit models to their leaderboard, first come, first serve.
Isn't there a maximum of models one user can submit in a given time window?

We really agree that some people have been abusing the system - that's why we had in place the limitation on model submissions per week based on the model organisation/user name, but there are ways to go around it that people found. It's annoying for us too, because all the time we lose by "playing the police" is time we can't spend on more interesting stuff for the community, like the Open LLM Leaderboard v2 or lighteval features. We'll likely try to introduce a voting system for the v2, so that the community can prioritize the models most relevant.

only the authors of the models should be able to submit them

We thought about it, but decided against (a couple months ago, I would be OK rediscussing), as in some cases the community wants to get the true evaluation numbers of a newly created models, and the model creator is... less interested in getting an accurate evaluation ^^

MaziyarPanahi

23 days ago

model creator is... less interested in getting an accurate evaluation ^^
Makes sense, at the end of the day this is a good problem to have. This means we still have new models on a daily basis and desire to know their scores. :)

Thanks @clefourrier :)

clefourrier changed discussion title from Stuck. to Queue is currently very slow 23 days ago

CultriX

22 days ago

•

edited 22 days ago

Thanks @clefourrier for the detailed explanation!

I was wondering if you had by any chance already seen my comment on people donating hardware: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard/discussions/570#661cb5fb439dad90fc2065a5

Basically I'm asking about the possibility of allowing for distributed computing where users can donate hardware (like folding@home or StableHorde) but in this situation for the purpose of completing specific (parts) of LLM-evaluation benchmarks they are provided with by a master server distributing tasks and keeping track of incoming evaluation results. I am wondering if this would be really hard to build or not? I'm not sure it sounds like it could be pretty easy to do but the fact that it's not already done tells me this is actually a lot more difficult than I imagine it to be.

Edit: Let me rephrase "easy" to "doable and therefore possible worth the efforts" as that is probably a more realistic description of reality lol.
Edit 2: For example I probably benchmarked around 150 models for my own leaderboard (https://huggingface.co/spaces/CultriX/Alt_LLM_LeaderBoard), simply because I had no way of donating resources to the main one (and also because I liked the fact that I could run another set of tests here, I admit, but that is just a nice extra to have. If I could have donated to the main LeaderBoard I would have preferred that I think).
Edit 3 (final one I promise): I don't know if I'm missing something but if you read this please add an export csv data option to the Open-LLM-Leaderboard like I added to my own. It's really useful to be able to download performance data of a ton of models with a single click! I might just be blind though and this option may already be there but I couldn't find it.

clefourrier

Hugging Face H4 org 22 days ago

Hi @CultriX ,

People donating hardware SETI-like is not trivial to setup (how do you manage people stopping sharing in the middle of an eval? correct model parallelism/data parallelism with different hardware sizes? jumping from one hardware to the next for a single evaluation? reproducibility?) and we absolutely do not have the bandwidth for this atm. The reproducibility aspect is a big part of the thing too, imo. At the moment, if your model is on the Open LLM Leaderboard, you know that it's been evaluated on the same hardware in exactly the same setup as all the other models.
Thanks for wanting to share your hardware though!

However, if community members want to explore setting up such a system, we could test it, and maybe discuss running part of the leaderboard with it.

There is a tool which does the export, you'll need to look in "Community Resources" in the discussions.

Closing the discussion to avoid it going on tangents, if you want to discuss the hardware sharing issue please do so in the other conv :)

clefourrier changed discussion status to closed 22 days ago

Walmart-the-bag

22 days ago

Been reading up, seems like solutions are being found and hopefully work. Thanks.