open-llm-leaderboard/open_llm_leaderboard · Why the Running Evaluation Queue is so slow?

Aug 9, 2023

•

edited Aug 11, 2023

I submitted psmathur/model_42_70b 6 days ago, and I continuously watching the Running and Pending Evaluation Queue from last few days, but it is really slow like super slow....

I have done same exact evals as suggested in "About" tab, using

wget https://github.com/EleutherAI/lm-evaluation-harness/archive/b281b0921b636bc36ad05c0b0b0763bd6dd43463.zip

and using the commands below, it took me all in all max 3 hours on 1 Azure Node of 4xA100 80GB to do all the 4 evals.

I am assuming HuggingFace is running this space on more then 1 node and more GPU's cards. Then why it is so slow to clear the queue. Is there something wrong with setup or just the plain hardware capacity issues?

here are the commands I used:

python main.py \
--model=hf-causal-experimental \
—model_args="pretrained=psmathur/model_42_70b,use_accelerate=True" \
--tasks=truthfulqa_mc \
--num_fewshot=0 \
--batch_size=auto \
--no_cache \
--output_path=results/model_42_70b/model_42_70b_v1_truthfulqa_mc_0-shot.json

############################################

python main.py \
--model=hf-causal-experimental \
--model_args="pretrained=psmathur/model_42_70b,use_accelerate=True" \
--tasks=arc_challenge \
--num_fewshot=25 \
--batch_size=auto \
--no_cache \
--output_path=results/model_42_70b/model_42_70b_v1_arc_challenge_25-shot.json


############################################

python main.py \
--model=hf-causal-experimental \
--model_args="pretrained=psmathur/model_42_70b,use_accelerate=True" \
--tasks=hellaswag \
--num_fewshot=10 \
--batch_size=auto \
--no_cache \
--output_path=results/model_42_70b/model_42_70b_v1_hellaswag_10-shot.json

############################################

python main.py \
--model=hf-causal-experimental \
--model_args="pretrained=psmathur/model_42_70b,use_accelerate=True" \
--tasks=hendrycksTest-* \
--num_fewshot=5 \
--batch_size=auto \
--no_cache \
--output_path=results/model_42_70b/model_42_70b_v1_mmlu_5-shot.json

Here is the current Pending Queue:

Here are my all the models which are pending evaluation, since last 5/6 days:

clefourrier

Open LLM Leaderboard org Aug 9, 2023

•

edited Aug 9, 2023

Hi!
We are running all models in the order they have been submitted, and we have been re-running some models which failed last week (which does not appear in the pending queue usually).
By the way, we run each model 1) on one node each (with a bit of optimization depending on size) 2) on the spare nodes of our cluster, which means that sometimes the leaderboard slows down a little.

Thank you for your patience!

clefourrier changed discussion status to closed Aug 9, 2023

pankajmathur

Aug 9, 2023

Thanks a lot @clefourrier for swift response, and details about how models are run on each node, why it could be sometime slow and how sometime background models running don't show up in pending queue.
Highly appreciated.

I have another question, and I think it may be just related to some background models running for evals not showing up in "gradio UI" pending evaluation view, but just want to confirm this theory.

I submitted "psmathur/orca_mini_v3_7b" before "psmathur/orca_mini_v3_13b" and It does show that I submitted this model already for evals. However, I am not seeing "psmathur/orca_mini_v3_7b" anywhere in pending, current or finished evaluations, whereas "psmathur/orca_mini_v3_13b" the one I submitted later is already in current evaluation. please see the attached screenshots.
Any response would be highly appreciated.

pankajmathur changed discussion status to open Aug 9, 2023

clefourrier

Open LLM Leaderboard org Aug 10, 2023

If you a model has completely disappeared from all queues then it likely failed, you can check this in the requests dataset.

clefourrier changed discussion status to closed Aug 10, 2023

pankajmathur

Aug 11, 2023

•

edited Aug 11, 2023

Thanks @clefourrier , yes the requests repo logs shows “psmathur/orca_mini_v3_7b “ failed, however when I ran all the evals on my local by re downloading from HF repo psmathur/orca_mini_v3_7b, all evals worked fine.
I updated them on model card too. Could you give me a little more background as to what the failure was related to and what can I do? I can’t resubmit the model as it throws me error model already submitted.

P.S. it is happening with my 4 other models too, evals run fine locally after downloading form HF repo but open llm lb requests repo showing failed for these models:

psmathur/orca_mini_v3_70b
psmathur/model_007_v2
psmathur/model_101
psmathur/model_007_13b

Please advise 🙏

pankajmathur changed discussion status to open Aug 11, 2023

clefourrier

Open LLM Leaderboard org Aug 11, 2023

Hi!
It would appear to be on our end, I'm relauching your models!

pankajmathur

Aug 11, 2023

Thank You, much appreciated.

pankajmathur

Aug 12, 2023

•

edited Aug 15, 2023

@clefourrier I also noticed another weird issues with my earlier model submissions, psmathur/model_007_13b was submitted on 2023-08-06, it got evaluated successfully with Finished status on 2023-08-09, It does shows up in => Finished Evaluations tab but It doesn't shows up in LB , here is "Finished" json from requests
https://huggingface.co/datasets/open-llm-leaderboard/requests/blob/main/psmathur/model_007_13b_eval_request_False_float16_Original.json

Any Idea what happen?

clefourrier

Open LLM Leaderboard org Aug 12, 2023

Hi @psmathur
The model psmathur/model_007_13b is one of the models you had asked me to relaunch above, so it got evaluated successfully during the night and you can find results here.
It's possible that the results have not been propagated yet to the leaderboard, as it's not an instantaneous process.

pankajmathur

Aug 12, 2023

Gotcha, thanks for explanation @clefourrier , will wait for it show up on LB, Although it’s already more than 12 hours so agreed not instantaneous process.
Thanks again 🙏, will close the ticket when all the submitted models will show up on LB.

pankajmathur

Aug 14, 2023

•

edited Aug 14, 2023

Hi @clefourrier ,

Hope you are doing well, Could you please provide us any suggestions what to do differently in order for any new submitted models not to FAILED and go through the evaluation process smoothly. Around 40% of our model submission FAILED, and we don’t know what to do differently to avoid this.

We submitted these models last week, after successfully running evals for ALL of them locally but open LLM LB requests shows FAILED on first attempt then we got message that you will be running them in background, but we don’t know where we are today on that?
The requests repo still shows log of FAILED, should be resubmit, if it is please let us know how is that possible, and if we just have to wait then please advise how longer?

Here is the list of models, which shows FAILED in requests repo:
psmathur/orca_mini_v3_70b
psmathur/model_007_v2
psmathur/model_101
psmathur/model_007_13b_v2
psmathur/orca_mini_v3_7b
psmathur/model_009
psmathur/model_101

I know this thread is going back and forth for some time but we really don’t have a concrete and clear process here, Hope you understand our pain point here.

Thanks,
Pankaj

pankajmathur

Aug 15, 2023

Hi,

We've noticed that our model evaluations for the open_llm_leaderboard submission have been failing. For example, psmathur/orca_mini_v3_7b requests repo shows FAILED again, Is this just us or it's happening with other submissions too?
We can confirm that we have been able to successfully evaluate all of above list of models remotely using exact same script . Could you please investigate the issue and provide some insight into the cause?

Thank you.

pankajmathur

Aug 16, 2023

Hi @clefourrier and team,

Could you please provide any updates on this? psmathur/orca_mini_v3_7b was submitted a week ago, and after multiple back and forth messages here, it is still not showing up on LB? We really don’t know what to do on our end, please advise?

Regards,
Pankaj

clefourrier

Open LLM Leaderboard org Aug 16, 2023

•

edited Aug 16, 2023

Hi @psmathur !
Your model psmathur/orca_mini_v3_7b failed when offloading to disk - I relaunched it just in case, but you might have to check the weights on your side.

For most of the other requests, we had a connectivity issue over the weekend, and results were not being sent back to the results dataset, it should be fixed now.

clefourrier changed discussion status to closed Aug 16, 2023

pankajmathur

Aug 16, 2023

Yeahh, looks like the relaunch works, Thanks @clefourrier for your response, and we really appreciated that finally psmathur/orca_mini_v3_7b results are showing up on LB.

Could you please relaunch these models too, They were also submitted last week and still not showing up in results dataset and may have similar issue with them too:

psmathur/orca_mini_v3_70b => https://huggingface.co/datasets/open-llm-leaderboard/requests/blob/main/psmathur/orca_mini_v3_70b_eval_request_False_float16_Original.json
psmathur/model_101 => https://huggingface.co/datasets/open-llm-leaderboard/requests/blob/main/psmathur/model_101_eval_request_False_float16_Original.json

Thanks again for your time & hard work on resolving the issue.

Regards,
Pankaj

pankajmathur changed discussion status to open Aug 16, 2023

clefourrier

Open LLM Leaderboard org Aug 17, 2023

I changed their status!

clefourrier changed discussion status to closed Aug 17, 2023