open-llm-leaderboard/open_llm_leaderboard

Aug 20

Could you please tell me why the uploaded model keeps failing the evaluation? It has failed three evaluations in the past half month. Below is the evaluation status file from the Requests dataset. Thank you for your reply.

https://huggingface.co/datasets/open-llm-leaderboard/requests/blob/main/LiteAI/Hare-1.1B-base-0.9v_eval_request_False_float16_Original.json

alozowski

Open LLM Leaderboard org Aug 20

Hi @LiteAI-Team ,

Thank you for bringing this to our attention. I’ve reviewed the git commits, and it appears that LiteAI/Hare-1.1B-base-0.9v failed the evaluation once. I’ve resubmitted it, and it should be fine now.

Please don’t hesitate to open a discussion here if you notice any issues with a model evaluation in the future. We’re here to assist you with resubmissions or to investigate the reasons behind any model failures

alozowski changed discussion status to closed Aug 20

LiteAI-Team

Aug 21

Hi @LiteAI-Team ,

Thank you for bringing this to our attention. I’ve reviewed the git commits, and it appears that LiteAI/Hare-1.1B-base-0.9v failed the evaluation once. I’ve resubmitted it, and it should be fine now.

Please don’t hesitate to open a discussion here if you notice any issues with a model evaluation in the future. We’re here to assist you with resubmissions or to investigate the reasons behind any model failures

Thank you for your reply, but unfortunately, it has still failed, including other uploaded versions:
https://huggingface.co/datasets/open-llm-leaderboard/requests/blob/main/LiteAI/Hare-1.1B-base-0.5v_eval_request_False_float16_Original.json
https://huggingface.co/datasets/open-llm-leaderboard/requests/blob/main/LiteAI/Hare-1.1B-base-0.9v_eval_request_False_float16_Original.json
https://huggingface.co/datasets/open-llm-leaderboard/requests/blob/main/LiteAI/Hare-1.1B-base_eval_request_False_float16_Original.json

alozowski changed discussion status to open Aug 21

alozowski

Open LLM Leaderboard org Aug 22

Hi @LiteAI-Team ,

The first model is finished, but I resubmitted Hare-1.1B-base-0.9v and Hare-1.1B-base– I'll keep an eye on them while they are running

LiteAI-Team

Aug 22

Hi @LiteAI-Team ,

The first model is finished, but I resubmitted Hare-1.1B-base-0.9v and Hare-1.1B-base– I'll keep an eye on them while they are running

I really appreciate it.

alozowski

Open LLM Leaderboard org Aug 22

Could you please check, wasn't the LiteAI/Hare-1.1B-base model renamed? The request file failed because it can't access it. I will appreciate if you send the link to the model card

LiteAI-Team

Aug 22

Could you please check, wasn't the LiteAI/Hare-1.1B-base model renamed? The request file failed because it can't access it. I will appreciate if you send the link to the model card

I have already deleted it, and it would be perfect if the others can pass the evaluation.

alozowski

Open LLM Leaderboard org Aug 22

I checked, for LiteAI/Hare-1.1B-base there are already finished evaluation here in details, please, note it

LiteAI-Team

Aug 22

Could you please check, wasn't the LiteAI/Hare-1.1B-base model renamed? The request file failed because it can't access it. I will appreciate if you send the link to the model card

In fact, the latest checkpoint is https://huggingface.co/datasets/open-llm-leaderboard/requests/blob/main/LiteAI/Hare1.0-Beta_eval_request_False_bfloat16_Original.json. Due to repeated evaluation failures, multiple versions have been uploaded to ensure we get a result. If https://huggingface.co/datasets/open-llm-leaderboard/requests/blob/main/LiteAI/Hare1.0-Beta_eval_request_False_bfloat16_Original.json can pass the evaluation, we will delete the other versions.

LiteAI-Team

Aug 22

I checked, for LiteAI/Hare-1.1B-base there are already finished evaluation here in details, please, note it

Thank you for your efforts. We acknowledge this result. After conducting local tests, we found that the addition of the token in the leaderboard evaluation code did not produce the desired effect, resulting in a discrepancy between the scores and the actual performance. Therefore, we removed the token and re-uploaded the model, hoping to achieve a more favorable outcome

LiteAI-Team

Aug 22

I checked, for LiteAI/Hare-1.1B-base there are already finished evaluation here in details, please, note it

We have removed the majority of the redundant versions and would like to re-upload our latest evaluation request: https://huggingface.co/datasets/open-llm-leaderboard/requests/blob/main/LiteAI/Hare1.0-Beta_eval_request_False_bfloat16_Original.json. We sincerely appreciate the efforts of you and your team.

alozowski

Open LLM Leaderboard org Aug 23

•

edited Aug 23

Hi @LiteAI-Team ,

Please, check the Leaderboard, LiteAI/Hare1.0-Beta has finished the evaluation. Can I help you with your other models?

LiteAI-Team

Aug 23

•

edited Aug 23

Hi @LiteAI-Team ,

Please, check the Leaderboard, LiteAI/Hare1.0-Beta has finished the evaluation. Can I help you with your other models?

Yes, if possible, I would like to change the bf16 parameter in the LiteAI/Hare1.0-Beta evaluation JSON config to fp16 and run it again, as the accuracy issue has caused discrepancies in the scores.

Additionally, while verifying the scores, we found that the scores produced by GPQA do not match the ones displayed on the leaderboard after applying regex. The scores provided by the backend are as follows:
"leaderboard_gpqa_diamond": {
"acc_norm,none": 0.26262626262626265,
"acc_norm_stderr,none": 0.031353050095330834,
"alias": " - leaderboard_gpqa_diamond"
},
"leaderboard_gpqa_extended": {
"acc_norm,none": 0.22344322344322345,
"acc_norm_stderr,none": 0.01784316739379994,
"alias": " - leaderboard_gpqa_extended"
},
"leaderboard_gpqa_main": {
"acc_norm,none": 0.29017857142857145,
"acc_norm_stderr,none": 0.021466115440571226,
"alias": " - leaderboard_gpqa_main"
}
After applying regex, it should be around 2.x, but the leaderboard shows 0.67.

Finally, thank you once again for your efforts！"

LiteAI-Team changed discussion status to closed Aug 23

LiteAI-Team changed discussion status to open Aug 23

alozowski changed discussion status to closed Aug 27

Spaces:

open-llm-leaderboard
/

open_llm_leaderboard

Running on CPU Upgrade

Failed requests