open-llm-leaderboard/open_llm_leaderboard · Llama-3 70b model eval failed and can't submit again

Apr 25

Hey,

our model failed on the eval process: https://huggingface.co/VAGOsolutions/Llama-3-SauerkrautLM-70b-Instruct

here is the log file: https://huggingface.co/datasets/open-llm-leaderboard/requests/blob/main/VAGOsolutions/Llama-3-SauerkrautLM-70b-Instruct_eval_request_False_bfloat16_Original.json

I suspect that something went wrong in the system, as many other models could not be evaluated at the same time.

Regards,
David

alozowski

Open LLM Leaderboard org Apr 26

Hi @DavidGF !

Thank you for providing the request file, I checked the log for your model, there was an issue on our side with a recent backend update – we fixed it and resubmitted your model, hopefully everything will be fine now!

I close this issue, please reopen it / write here in case of any other problems with this model

alozowski changed discussion status to closed Apr 26

DavidGF

Apr 26

Hi @alozowski ,
Thank you but unfortunately after few hours of running the model failed again.
Hope it can be fixed soon.
Thanks in advance!

alozowski changed discussion status to open Apr 29

alozowski

Open LLM Leaderboard org Apr 29

Hi @DavidGF ,

I checked the logs, it was a network issue, so I resubmitted your model

Hope the evaluation will go well now!

alozowski changed discussion status to closed Apr 29

DavidGF

Apr 30

Hey @alozowski ,
Unfortunately the evaluation failed again.
Could you look over it again?
Thank you in advance again!

DavidGF

May 1

Hey @alozowski and @clefourrier
We checked the model again and did not find any problems with the download or any corrupt files. We also download it directly from the repo and subjected it to various benchmarks ourselves; Here too, everything worked without any problems. f there are any further steps we can take to ensure the evaluation can be carried out, we would be happy to hear from you.
Thanks in advance!

clefourrier

Open LLM Leaderboard org May 2

Hi!
I can confirm we managed to download it properly - I passed it in pending again, it was launched on a faulty node.
Feel free to ping us if it fails again (hopefully not)

DavidGF

May 5

Hi thanks for passing it in the queue again!
But it seems that there is something off. In the meantime several 70b models have been benchmarked while ours still hangs in the queue.
Could you please have a look?
Thank you!

alozowski

Open LLM Leaderboard org May 6

Hi @DavidGF ,

As I see now, the model status is running according to the log, let's wait for the job to finish and see if it will fail (hope it'll be fine)

DavidGF

May 12

Hey @alozowski and @clefourrier ,
Unfortunately, the evaluation for the model failed again after 10 days in the queue: logfile
I believe there is still a faulty node in your system because about 23 models were queued for 10 days until they finally failed. Meanwhile, many other large models have been evaluated (probably on a different node).
Our model has been struggling to be evaluated for almost 20 days now.
I am aware that the leaderboard is a free offer from you and I really appreciate that! But I would also be happy if you can still evaluate our model. If I can support anywhere, please let me know!
Many thanks in advance

clefourrier

Open LLM Leaderboard org May 13

Hi!
I passed your model to pending - it's actually been preempted by one of our trainings which just started and is taking the full cluster, so it will be evaluated again when some nodes free up, which won't be soon I fear.

DavidGF

about 1 month ago

Hi @clefourrier
Unfortunately the evaluation failed again: logfile
I am aware that the cluster is quite full right now and you are planning something exciting new!
However, we have now been waiting a month for this model to be evaluated and it would be great to have a model represented on the leaderboard that was trained in German and continues to achieve strong results in English benchmarks.
We would really be happy if you could reintegrate the model into the running queue with a higher priority.
I think that it brings a lot of added value for the community if you can show that models that have been trained in foreign languages can continue to achieve strong results in the basic language.

Many thanks in advance

clefourrier

Open LLM Leaderboard org 29 days ago

Hi @DavidGF !
I just passed your model to pending again - as models get evaluated by time of submission, the "older" a model rescheduled, the higher its priority.

To give you an idea however, a 70B model requires between 10h and a full day to be evaluated on 8 H100 GPUs - our cluster has not had resources free for that long in a while, so there is no guarantee evaluation will happen fast.

DavidGF

28 days ago

Hello @clefourrier ,
Thank you for putting it back in the running queue and I hope that we are lucky enough to have the model evaluated soon :)

As I said, I absolutely understand that if the resources are needed for another project, the computing resources for the free OpenLLM Leaderboard service must be limited.
I just can't fully understand how your prioritization process works, in which other large LLMS that were submitted according to our model were evaluated at short notice without any problems.

As far as I can tell, all of these models were submitted after our submission or relatively close to it, while evaluation resources were already limited:

MaziyarPanahi/Llama-3-70B-Instruct-DPO-v0.1,
MaziyarPanahi/Llama-3-70B-Instruct-DPO-v0.2,
MaziyarPanahi/Llama-3-70B-Instruct-DPO-v0.3,
MaziyarPanahi/Llama-3-70B-Instruct-DPO-v0.4,
mmnga/Llama-3-70B-japanese-suzume-vector-v0.1,
tenyx/Llama3-TenyxChat-70B,
failspy/llama-3-70B-Instruct-abliterated,
abhishek/autotrain-llama3-70b-orpo-v1,
shenzhi-wang/Llama3-70B-Chinese-Chat,
NeverSleep/Llama-3-Lumimaid-70B-v0.1-alt,
ValiantLabs/Llama3-70B-Fireplace,
NeverSleep/Llama-3-Lumimaid-70B-v0.1,
abacusai/Llama-3-Giraffe-70B-Instruct,
jondurbin/airoboros-70b-3.3,
migtissera/Tess-2.0-Llama-3-70B,
gradientai/Llama-3-70B-Instruct-Gradient-262k,
migtissera/Tess-2.0-Llama-3-70B-v0.2,
gradientai/Llama-3-70B-Instruct-Gradient-524k,
nvidia/Llama3-ChatQA-1.5-70B.

So please don't get me wrong, I'm just wondering how this situation could have come about.

clefourrier

Open LLM Leaderboard org 28 days ago

Hi!

Several options are possible:

some of these were submitted quantized (hence taking a lot less memory/time because we can play with the parallelism then)
some of these finished under the usual time (could be the case for models that are not verbose)
your model was cancelled at a time when the rescheduling failed, and it was not the case for these models

DavidGF

21 days ago

Thanks for the information!
Unfortunately, the evaluation failed again and many other models could not be evaluated with our model. I strongly suspect that it has something to do with the 8x22B models. Every time there is a Mixtral 8x22B in the queue, all models appear to crash on the same node.
Sorry to be so annoying, but could you please add the model to the queue again @clefourrier ?
Best regards!

clefourrier

Open LLM Leaderboard org 21 days ago

Hi!
I don't understand what you mean with "all models appear to crash on the same node". I just checked the log, and your job was cancelled due to preemption and not rescheduled - it's not the first time it's happening, I think our rescheduling system broke. I'm relauching your model.