Spaces:
Running
on
CPU Upgrade
Failed requests
Could you please tell me why the uploaded model keeps failing the evaluation? It has failed three evaluations in the past half month. Below is the evaluation status file from the Requests dataset. Thank you for your reply.
Hi @LiteAI-Team ,
Thank you for bringing this to our attention. I’ve reviewed the git commits, and it appears that LiteAI/Hare-1.1B-base-0.9v
failed the evaluation once. I’ve resubmitted it, and it should be fine now.
Please don’t hesitate to open a discussion here if you notice any issues with a model evaluation in the future. We’re here to assist you with resubmissions or to investigate the reasons behind any model failures
Hi @LiteAI-Team ,
Thank you for bringing this to our attention. I’ve reviewed the git commits, and it appears that
LiteAI/Hare-1.1B-base-0.9v
failed the evaluation once. I’ve resubmitted it, and it should be fine now.Please don’t hesitate to open a discussion here if you notice any issues with a model evaluation in the future. We’re here to assist you with resubmissions or to investigate the reasons behind any model failures
Thank you for your reply, but unfortunately, it has still failed, including other uploaded versions:
https://huggingface.co/datasets/open-llm-leaderboard/requests/blob/main/LiteAI/Hare-1.1B-base-0.5v_eval_request_False_float16_Original.json
https://huggingface.co/datasets/open-llm-leaderboard/requests/blob/main/LiteAI/Hare-1.1B-base-0.9v_eval_request_False_float16_Original.json
https://huggingface.co/datasets/open-llm-leaderboard/requests/blob/main/LiteAI/Hare-1.1B-base_eval_request_False_float16_Original.json
Hi @LiteAI-Team ,
The first model is finished, but I resubmitted Hare-1.1B-base-0.9v
and Hare-1.1B-base
– I'll keep an eye on them while they are running
Hi @LiteAI-Team ,
The first model is finished, but I resubmitted
Hare-1.1B-base-0.9v
andHare-1.1B-base
– I'll keep an eye on them while they are running
I really appreciate it.
Could you please check, wasn't the LiteAI/Hare-1.1B-base
model renamed? The request file failed because it can't access it. I will appreciate if you send the link to the model card
Could you please check, wasn't the
LiteAI/Hare-1.1B-base
model renamed? The request file failed because it can't access it. I will appreciate if you send the link to the model card
I have already deleted it, and it would be perfect if the others can pass the evaluation.
I checked, for LiteAI/Hare-1.1B-base
there are already finished evaluation here in details, please, note it
Could you please check, wasn't the
LiteAI/Hare-1.1B-base
model renamed? The request file failed because it can't access it. I will appreciate if you send the link to the model card
In fact, the latest checkpoint is https://huggingface.co/datasets/open-llm-leaderboard/requests/blob/main/LiteAI/Hare1.0-Beta_eval_request_False_bfloat16_Original.json. Due to repeated evaluation failures, multiple versions have been uploaded to ensure we get a result. If https://huggingface.co/datasets/open-llm-leaderboard/requests/blob/main/LiteAI/Hare1.0-Beta_eval_request_False_bfloat16_Original.json can pass the evaluation, we will delete the other versions.
I checked, for
LiteAI/Hare-1.1B-base
there are already finished evaluation here in details, please, note it
I checked, for
LiteAI/Hare-1.1B-base
there are already finished evaluation here in details, please, note it
Thank you for your efforts. We acknowledge this result. After conducting local tests, we found that the addition of the token in the leaderboard evaluation code did not produce the desired effect, resulting in a discrepancy between the scores and the actual performance. Therefore, we removed the token and re-uploaded the model, hoping to achieve a more favorable outcome
I checked, for
LiteAI/Hare-1.1B-base
there are already finished evaluation here in details, please, note it
We have removed the majority of the redundant versions and would like to re-upload our latest evaluation request: https://huggingface.co/datasets/open-llm-leaderboard/requests/blob/main/LiteAI/Hare1.0-Beta_eval_request_False_bfloat16_Original.json. We sincerely appreciate the efforts of you and your team.
Hi @LiteAI-Team ,
Please, check the Leaderboard, LiteAI/Hare1.0-Beta
has finished the evaluation. Can I help you with your other models?
Hi @LiteAI-Team ,
Please, check the Leaderboard,
LiteAI/Hare1.0-Beta
has finished the evaluation. Can I help you with your other models?
Yes, if possible, I would like to change the bf16 parameter in the LiteAI/Hare1.0-Beta evaluation JSON config to fp16 and run it again, as the accuracy issue has caused discrepancies in the scores.
Additionally, while verifying the scores, we found that the scores produced by GPQA do not match the ones displayed on the leaderboard after applying regex. The scores provided by the backend are as follows:
"leaderboard_gpqa_diamond": {
"acc_norm,none": 0.26262626262626265,
"acc_norm_stderr,none": 0.031353050095330834,
"alias": " - leaderboard_gpqa_diamond"
},
"leaderboard_gpqa_extended": {
"acc_norm,none": 0.22344322344322345,
"acc_norm_stderr,none": 0.01784316739379994,
"alias": " - leaderboard_gpqa_extended"
},
"leaderboard_gpqa_main": {
"acc_norm,none": 0.29017857142857145,
"acc_norm_stderr,none": 0.021466115440571226,
"alias": " - leaderboard_gpqa_main"
}
After applying regex, it should be around 2.x, but the leaderboard shows 0.67.
Finally, thank you once again for your efforts!"