Failed reason

#450
by Mihaiii - opened

@SaylorTwift Could you please rerun the pipeline? I can't resubmit them. :(

Hugging Face H4 org

Hi! There was a connection error when loading your models, I added them back to pending

clefourrier changed discussion status to closed
Mihaiii changed discussion status to open
Hugging Face H4 org

Hi! Thanks for the complete issue!
Checked the logs, and they have all been cancelled because of pre-emption, added them back to pending :)

clefourrier changed discussion status to closed

@clefourrier Thanks for adding them back to pending, but all of them failed again.

Is it something I did wrong? I don't think I did anything different compared to models that successfully ran.

Mihaiii changed discussion status to open
Hugging Face H4 org

Hi @Mihaiii ,
I'm very sorry about the inconvenience! We're changing our backend from one cluster to another and had a bunch of env failures, I passed them back to pending again - I hope after tomorrow we will have fixed everything.
cc @SaylorTwift

clefourrier changed discussion status to closed
Mihaiii changed discussion status to open
Hugging Face H4 org

Hi,
FYI, the new cluster is having strong connectivity problems, we are putting all evals on hold til it's fixed, and we'll relaunch all FAILED evals of the past 2 days

I submitted them with bfloat16 precision instead of float16 and they were evaluated. This works for me so I'm closing this thread.

Mihaiii changed discussion status to closed

Sign up or log in to comment