Model evaluation and submission stuck of LB.

#17
by abideen - opened

Hi, The evaluation queue of Leaderboard has been stuck for a few days. Can you guys check it out and get it back up? Thank you.

it has been stuck since 2024-05may-31
(~35 days as of 2024-07jul-05)

previously it would run pretty quickly, not frozen progress (same numbers of finished/pending models for days)

Question: how much longer or how much more resources (vRAM or compute) does it take for float32 precision (vs float16 or bfloat16) to run, given a certain model size?

could it be , that too many float32 models are running at the same time, that is frozen like this?,
are there any logs, about the current progress for the running models, e.g. what task/sub-test is is on, is the progress moving forward, and any indications of ETA?

by stuck, we mean, the leaderboard is stuck at a count of only 231 finished models, with no more new ones  being added to the results
(see logs for timeline)

@aaditya @aryopg , any updates?

iI get how float32 is cool, if it were feasable, but is the difference on the huggingface keaderboard, the difference between float16 and bfloat16 enough ? - often only a few tenths of a percentage points, something to keep in mind. Could the number of concurrent float32 running models be limited/de-prioritized, without restarting the progress (rerun all over again), to prevent clogging?, could there be info/logging about how much progress-status / about the current (sub-question/sub-test task it's on) to let us know the ETA and how well it's moving, if at all, to help gauge if it's worth progressing? Over this period of time, aren't newer and better models coming out?, maybe?, what is a good way to weigh this?

https://huggingface.co/datasets/openlifescienceai/requests/blob/main/01-ai/Yi-1.5-9B-Chat-16K_eval_request_False_float32_Original.json
https://huggingface.co/datasets/openlifescienceai/requests/blob/main/01-ai/Yi-1.5-9B-Chat_eval_request_False_float32_Original.json

Are more closer to the core-model, it would make somewhat sense to prioritise these two, maybe more than the others... , perhaps?

Sign up or log in to comment