Failed requests

#888
by LiteAI-Team - opened

Could you please tell me why the uploaded model keeps failing the evaluation? It has failed three evaluations in the past half month. Below is the evaluation status file from the Requests dataset. Thank you for your reply.

https://huggingface.co/datasets/open-llm-leaderboard/requests/blob/main/LiteAI/Hare-1.1B-base-0.9v_eval_request_False_float16_Original.json

Open LLM Leaderboard org

Hi @LiteAI-Team ,

Thank you for bringing this to our attention. I’ve reviewed the git commits, and it appears that LiteAI/Hare-1.1B-base-0.9v failed the evaluation once. I’ve resubmitted it, and it should be fine now.

Please don’t hesitate to open a discussion here if you notice any issues with a model evaluation in the future. We’re here to assist you with resubmissions or to investigate the reasons behind any model failures

alozowski changed discussion status to closed

Hi @LiteAI-Team ,

Thank you for bringing this to our attention. I’ve reviewed the git commits, and it appears that LiteAI/Hare-1.1B-base-0.9v failed the evaluation once. I’ve resubmitted it, and it should be fine now.

Please don’t hesitate to open a discussion here if you notice any issues with a model evaluation in the future. We’re here to assist you with resubmissions or to investigate the reasons behind any model failures

Thank you for your reply, but unfortunately, it has still failed, including other uploaded versions:
https://huggingface.co/datasets/open-llm-leaderboard/requests/blob/main/LiteAI/Hare-1.1B-base-0.5v_eval_request_False_float16_Original.json
https://huggingface.co/datasets/open-llm-leaderboard/requests/blob/main/LiteAI/Hare-1.1B-base-0.9v_eval_request_False_float16_Original.json
https://huggingface.co/datasets/open-llm-leaderboard/requests/blob/main/LiteAI/Hare-1.1B-base_eval_request_False_float16_Original.json

alozowski changed discussion status to open
Open LLM Leaderboard org

Hi @LiteAI-Team ,

The first model is finished, but I resubmitted Hare-1.1B-base-0.9v and Hare-1.1B-base– I'll keep an eye on them while they are running

Hi @LiteAI-Team ,

The first model is finished, but I resubmitted Hare-1.1B-base-0.9v and Hare-1.1B-base– I'll keep an eye on them while they are running

I really appreciate it.

Open LLM Leaderboard org

Could you please check, wasn't the LiteAI/Hare-1.1B-base model renamed? The request file failed because it can't access it. I will appreciate if you send the link to the model card

Could you please check, wasn't the LiteAI/Hare-1.1B-base model renamed? The request file failed because it can't access it. I will appreciate if you send the link to the model card

I have already deleted it, and it would be perfect if the others can pass the evaluation.

Open LLM Leaderboard org

I checked, for LiteAI/Hare-1.1B-base there are already finished evaluation here in details, please, note it

Could you please check, wasn't the LiteAI/Hare-1.1B-base model renamed? The request file failed because it can't access it. I will appreciate if you send the link to the model card

In fact, the latest checkpoint is https://huggingface.co/datasets/open-llm-leaderboard/requests/blob/main/LiteAI/Hare1.0-Beta_eval_request_False_bfloat16_Original.json. Due to repeated evaluation failures, multiple versions have been uploaded to ensure we get a result. If https://huggingface.co/datasets/open-llm-leaderboard/requests/blob/main/LiteAI/Hare1.0-Beta_eval_request_False_bfloat16_Original.json can pass the evaluation, we will delete the other versions.

I checked, for LiteAI/Hare-1.1B-base there are already finished evaluation here in details, please, note it

I checked, for LiteAI/Hare-1.1B-base there are already finished evaluation here in details, please, note it

Thank you for your efforts. We acknowledge this result. After conducting local tests, we found that the addition of the token in the leaderboard evaluation code did not produce the desired effect, resulting in a discrepancy between the scores and the actual performance. Therefore, we removed the token and re-uploaded the model, hoping to achieve a more favorable outcome

I checked, for LiteAI/Hare-1.1B-base there are already finished evaluation here in details, please, note it

We have removed the majority of the redundant versions and would like to re-upload our latest evaluation request: https://huggingface.co/datasets/open-llm-leaderboard/requests/blob/main/LiteAI/Hare1.0-Beta_eval_request_False_bfloat16_Original.json. We sincerely appreciate the efforts of you and your team.

Open LLM Leaderboard org
edited Aug 23

Hi @LiteAI-Team ,

Please, check the Leaderboard, LiteAI/Hare1.0-Beta has finished the evaluation. Can I help you with your other models?
Screenshot 2024-08-23 at 11.53.33.png

Hi @LiteAI-Team ,

Please, check the Leaderboard, LiteAI/Hare1.0-Beta has finished the evaluation. Can I help you with your other models?
Screenshot 2024-08-23 at 11.53.33.png

Yes, if possible, I would like to change the bf16 parameter in the LiteAI/Hare1.0-Beta evaluation JSON config to fp16 and run it again, as the accuracy issue has caused discrepancies in the scores.

Additionally, while verifying the scores, we found that the scores produced by GPQA do not match the ones displayed on the leaderboard after applying regex. The scores provided by the backend are as follows:
"leaderboard_gpqa_diamond": {
"acc_norm,none": 0.26262626262626265,
"acc_norm_stderr,none": 0.031353050095330834,
"alias": " - leaderboard_gpqa_diamond"
},
"leaderboard_gpqa_extended": {
"acc_norm,none": 0.22344322344322345,
"acc_norm_stderr,none": 0.01784316739379994,
"alias": " - leaderboard_gpqa_extended"
},
"leaderboard_gpqa_main": {
"acc_norm,none": 0.29017857142857145,
"acc_norm_stderr,none": 0.021466115440571226,
"alias": " - leaderboard_gpqa_main"
}
After applying regex, it should be around 2.x, but the leaderboard shows 0.67.

Finally, thank you once again for your efforts!"

LiteAI-Team changed discussion status to closed
LiteAI-Team changed discussion status to open
alozowski changed discussion status to closed

Sign up or log in to comment