72b models eval failed

#689
by paloalma - opened
Hugging Face H4 org

Hi! Next time, please reopen the issues instead of opening new ones - that way it tags everyone who was part of the convo.

Operation was aborted for all models when trying to assemble the shards. Are you sure your models are formatted properly?

Loading checkpoint shards: 100%|██████████| 82/82 [00:39<00:00,  2.10it/s]
[E ProcessGroupNCCL.cpp:475] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3000000) ran for 3000225 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3000000) ran for 3000225 milliseconds before timing out.

Hi ! @clefourrier

Thanks for the logs !

We are surprised as we have successfully tested a lot of them without error.
Here is the Benchmarks results on EQBench for the two following model (done with 5H100) :
https://huggingface.co/datasets/open-llm-leaderboard/requests/commit/aa62e2c672268c5737dcb2cd554872ad26b2ccb6
Benchmark Results
====================================
Id : ECE-TW3-JRGL-V1
Date : 2024-04-09 19:23:47
Success : True
Model : paloalma/ECE-TW3-JRGL-V1
this_score : 82.8
Bench tries : 0
Parseable : 171.0
EQ-Bench version : v2

and
https://huggingface.co/datasets/open-llm-leaderboard/requests/commit/7f30b5503a42c4a4174b3a3a9b081fcc061b9c4f
Benchmark Results
====================================
Id : TW3-JRGL-v2
Date : 2024-04-10 05:46:49
Success : True
Model : paloalma/TW3-JRGL-v2
this_score : 82.15
Bench tries : 0
Parseable : 170.0
EQ-Bench version : v2


Is it possible to submit these two again for evaluation ?

Hugging Face H4 org

Thank you for sharing the EQBench results! We are investigating the problem with your models on our side. We'll keep you posted as we figure things out and will relaunch your models as soon as the problem will be solved

Thank you for sharing the EQBench results! We are investigating the problem with your models on our side. We'll keep you posted as we figure things out and will relaunch your models as soon as the problem will be solved

Hello Guys !

Thanks for relaunching the models !
Unfortunately, they seem to have failed once again. Do you have any idea why they can't be evaluated, or any leads?
We're wondering if it's due to the fact that this is a merged model, as we haven't seen a lot of them of this size?

Hope to receive good news from you soon !

Thanks, HF Team !
@alozowski @clefourrier

Hugging Face H4 org

Hi!
The merged aspect is an interesting hypothesis, but we've never had issues for merged models before.
However, since the CUDA issue above has appeared consistently on all the restarts of your models, and only for them specifically (so it's hard to pinpoint where the problem comes from since we have no comparison point), we'll close this issue as we don't have the bandwidth to investigate.

clefourrier changed discussion status to closed

Hello guys !!!!

It's incredible !!! We saw that TW3-JRGL-V1 went through and is now TOP 1 on the leaderboard we are so excited !!!!

Thank you for running the evaluation again !

Do you think you'll be able to run the other ones again ?

Also what was the final cause of all these errors when running the other models ?

Thank you for everything !

@alozowski @clefourrier

Also just Fyi since it came first we changed its name from TW3-JRGL-V1 to Le_Triomphant-ECE-TW3, will the change appear on the leaderboard also ?

Thank you very much.

Hugging Face H4 org

Hi @paloalma ,

Congrats! ✨

Considering renaming, I see that there is a separate request file with a model called Le_Triomphant-ECE-TW3– is this a different model than TW3-JRGL-v1?

alozowski changed discussion status to open

Hi @alozowski !

Thank you !

It's indeed the same model, we were afraid that it was deleted and that the scores were lost so we tried to re-submit it.

So Le_Triomphant-ECE-TW3 and TW3-JRGL-v1 are the same.

Thanks !

Hugging Face H4 org

Great! I renamed your model, please, check out my screenshot

Screenshot 2024-05-07 at 16.35.36.png

Hugging Face H4 org

@paloalma can I help you with something else?

Thank you very much @alozowski !

Just a quick question, what were the issues with our models, and what was modified to allow them to get through the evaluation ?

It could be very interesting for us to know.

Again, thank you for your work !

Paloalma

Sign up or log in to comment