No good way to identify number of activated parameters causes MIxtral evaluation failures

#680
by 0-hero - opened

Hey @clefourrier I noticed all the 8x22B finetunes failed
HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1 @lewtun ?
migtissera/Tess-2.0-Mixtral-8x22B @migtissera
0-hero/Matter-0.2-8x22B mine
and maybe a few more I missed

Open LLM Leaderboard org

Hi all!
As you can see from the job ids (-1), the jobs were not launched - this is because our backend assumes that the models have 140B activated parameters (which is too big for the cluster, hence skipped), not 140B total parameters with considerably less activated. I'm unsure there is an easy way for us to make the difference automatically at the moment, but we'll gladly update our backend and re-submit your models once we can get this information.

clefourrier changed discussion title from 8x22B's failing to No good way to identify number of activated parameters causes MIxtral evaluation failures

Hey, is this fixed now or still waiting?

Open LLM Leaderboard org

Hi everyone!

Thanks to @SaylorTwift , now we can submit moe models bigger than 140B for evaluation, thus I resubmitted this one for @MaziyarPanahi

Please, provide me with the requests files for similar models to resubmit

Fantastic! Thanks @alozowski and @SaylorTwift

Open LLM Leaderboard org

Resubmitted both migtissera/Tess-2.0-Mixtral-8x22B and 0-hero/Matter-0.2-8x22B 👍

In that case I close this discussion, if there are any problems with models evaluations please open new ones for each model

alozowski changed discussion status to closed

Says FAILED

I think all 3 failed again

Yes, I created a separate discussion for my models. 2 of the failed models were 8B, so something else might have happened.

alozowski changed discussion status to open
Open LLM Leaderboard org

Hi everyone!

Hmm, I see, all these models have indeed failed, let me investigate

Hey, any update here on the Tess model? Do you want me to open a separate ticket to track it? This is the model: https://huggingface.co/datasets/open-llm-leaderboard/requests/blob/main/migtissera/Tess-2.0-Mixtral-8x22B_eval_request_False_float16_Original.json

There seems to be something going on with the LB eval cluster, at least for some large models. Even my Llama-3-70B submission has been running for the last 2 days. https://huggingface.co/datasets/open-llm-leaderboard/requests/blob/main/MaziyarPanahi/Llama-3-70B-Instruct-DPO-v0.1_eval_request_False_bfloat16_Original.json

Open LLM Leaderboard org

Hi!

We're still looking up ways to launch moe models correctly on our backend - and we've also had network failures on our cluster last week. We'll keep you posted as soon as we have updates.

@MaziyarPanahi , what you are reporting is normal and unrelated to the current issue :) When the research cluster is full, the evaluation jobs are cancelled and rescheduled, but we keep the status to "running" to keep it simple for end users. It's likely your model was "running, cancelled, rescheduled, running, ..."

Hi @clefourrier

Thanks for the update regarding MoE models, appreciate it.

but we keep the status to "running" to keep it simple for end users. It's likely your model was "running, cancelled, rescheduled, running, ..."

I didn't know that, it makes sense now. Thank you :)

Open LLM Leaderboard org
edited May 4

Doing it right now, tell me if it works.

To this day, the only 8x22B models in the Leaderboard are from MistralAI. I don't believe we have ever had a successful eval on any 8x22B fine-tuned models. @clefourrier is the issue resolved and the only limitation is to find free resources? Or we still don't know if the MoE models with this size might get rejected?

Open LLM Leaderboard org

These ones we launched manually when they came out because they were important for the community.
Good question, I think @SaylorTwift took a look at the backend side so I'll let him answer.
(The main problem we had was (as indicated in the title) identifying the number of activated params in MoEs.)

Open LLM Leaderboard org

hi! your models failed during download, it has been requeued. however, the cluster is really ful atm so it might take a bit for your model to be ran.

hi! your models failed during download, it has been requeued. however, the cluster is really ful atm so it might take a bit for your model to be ran.

I think that's a good news, it got an instance to start the download process at least :)
Thanks @SaylorTwift appreciate the help

Open LLM Leaderboard org
edited May 27

Since all three models failed, I've resubmitted all of them:

I'll keep an eye on them and check the evaluation status

Sorry @alozowski , but the model failed again

Hey there!

Any update here?

Thanks!

Open LLM Leaderboard org
edited Jun 5

Hey there!

Edit: We're having a problem with the parameter estimation of these models and can't launch them at the moment as they are estimated as 140B models, therefore requesting multiple nodes. It would not be too big of a problem if we were not so compute tight atm.
I won't relaunch them for now, but once we do our big update we'll investigate this a bit more. You can also ping people from the safetensors side for the params estimation.

Yeah, they're definitely beefy! Okay, sounds good Clementine!

clefourrier changed discussion status to closed

Sign up or log in to comment