HuggingFaceH4/open_llm_leaderboard · 72b models eval failed

19 days ago

It seems the models submitted yesterday failed again.

https://huggingface.co/datasets/open-llm-leaderboard/requests/blob/main/TW3PartnersLLM/tw3jrglv3_eval_request_False_float16_Original.json

https://huggingface.co/datasets/open-llm-leaderboard/requests/blob/main/paloalma/TW3-JRGL-v1_eval_request_False_bfloat16_Original.json

https://huggingface.co/datasets/open-llm-leaderboard/requests/blob/main/paloalma/ECE-TW3-JRGL-V3_eval_request_False_bfloat16_Original.json

Could we push this one first in priority if pushing too many makes the system fail ?

https://huggingface.co/datasets/open-llm-leaderboard/requests/commit/aa62e2c672268c5737dcb2cd554872ad26b2ccb6

We are currently benchmarking it on eqbench and need also the other benchmark scoring for our research paper.

Is it possible to also get the logs ?

Thanks for your help,
Andre

clefourrier

Hugging Face H4 org 19 days ago

Hi! Next time, please reopen the issues instead of opening new ones - that way it tags everyone who was part of the convo.

Operation was aborted for all models when trying to assemble the shards. Are you sure your models are formatted properly?

Loading checkpoint shards: 100%|██████████| 82/82 [00:39<00:00,  2.10it/s]
[E ProcessGroupNCCL.cpp:475] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3000000) ran for 3000225 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3000000) ran for 3000225 milliseconds before timing out.

paloalma

19 days ago

Hi ! @clefourrier

Thanks for the logs !

We are surprised as we have successfully tested a lot of them without error.
Here is the Benchmarks results on EQBench for the two following model (done with 5H100) :
https://huggingface.co/datasets/open-llm-leaderboard/requests/commit/aa62e2c672268c5737dcb2cd554872ad26b2ccb6
Benchmark Results
====================================
Id : ECE-TW3-JRGL-V1
Date : 2024-04-09 19:23:47
Success : True
Model : paloalma/ECE-TW3-JRGL-V1
this_score : 82.8
Bench tries : 0
Parseable : 171.0
EQ-Bench version : v2

and
https://huggingface.co/datasets/open-llm-leaderboard/requests/commit/7f30b5503a42c4a4174b3a3a9b081fcc061b9c4f
Benchmark Results
====================================
Id : TW3-JRGL-v2
Date : 2024-04-10 05:46:49
Success : True
Model : paloalma/TW3-JRGL-v2
this_score : 82.15
Bench tries : 0
Parseable : 170.0
EQ-Bench version : v2

Is it possible to submit these two again for evaluation ?

alozowski

Hugging Face H4 org 15 days ago

Thank you for sharing the EQBench results! We are investigating the problem with your models on our side. We'll keep you posted as we figure things out and will relaunch your models as soon as the problem will be solved

paloalma

7 days ago

Thank you for sharing the EQBench results! We are investigating the problem with your models on our side. We'll keep you posted as we figure things out and will relaunch your models as soon as the problem will be solved

Hello Guys !

Thanks for relaunching the models !
Unfortunately, they seem to have failed once again. Do you have any idea why they can't be evaluated, or any leads?
We're wondering if it's due to the fact that this is a merged model, as we haven't seen a lot of them of this size?

Hope to receive good news from you soon !

Thanks, HF Team !
@alozowski @clefourrier

clefourrier

Hugging Face H4 org 5 days ago

Hi!
The merged aspect is an interesting hypothesis, but we've never had issues for merged models before.
However, since the CUDA issue above has appeared consistently on all the restarts of your models, and only for them specifically (so it's hard to pinpoint where the problem comes from since we have no comparison point), we'll close this issue as we don't have the bandwidth to investigate.

clefourrier changed discussion status to closed 5 days ago

paloalma

1 day ago

Hello guys !!!!

It's incredible !!! We saw that TW3-JRGL-V1 went through and is now TOP 1 on the leaderboard we are so excited !!!!

Thank you for running the evaluation again !

Do you think you'll be able to run the other ones again ?

Also what was the final cause of all these errors when running the other models ?

Thank you for everything !

@alozowski @clefourrier

paloalma

1 day ago

Also just Fyi since it came first we changed its name from TW3-JRGL-V1 to Le_Triomphant-ECE-TW3, will the change appear on the leaderboard also ?

Thank you very much.

alozowski

Hugging Face H4 org about 9 hours ago

Hi @paloalma ,

Congrats! ✨

Considering renaming, I see that there is a separate request file with a model called Le_Triomphant-ECE-TW3– is this a different model than TW3-JRGL-v1?