open-llm-leaderboard/open_llm_leaderboard · Model benchmarks degraded after re-evaluation

8 days ago

Hey, after re-evaluating the model with use_chat_template set on the model the performance degrades alot.
this model
Etherll/Qwen2.5-7B-della-test
Can we undo this?

alozowski

Open LLM Leaderboard org 2 days ago

Hi @Etherll ,

I see that the model Etherll/Qwen2.5-7B-della-test includes a chat_template, so it’s expected to evaluate it with use_chat_template = True for proper alignment with its intended use. However, I understand your concern regarding the performance drop. We plan to update the request file naming conventions and introduce the ability to distinguish between chat-template-based and non-chat-template-based evaluations for the same model. Currently, the system only considers the most recent evaluation, regardless of whether it used a chat template, which can lead to scenarios like this.

Thanks for bringing this up – we’ll keep you updated on the progress!

I'm closing this issue now, feel free to ping me here in case of any questions on this topic or please open a new discussion!

alozowski changed discussion status to closed 2 days ago