Is the score computed by lm-eval-harness normalized?

#1011
by chenxiaobooo - opened

As shown in the images below, the lm-evaluation-harness computes the metrics for sub-tasks (e.g., leaderboard_gpqa_diamond/extended/main). Is the score normalized using the method introduced by https://huggingface.co/docs/leaderboards/open_llm_leaderboard/normalization? If I am not mistaken, I only need to calculate the average score of all sub-tasks to get the metric for each task, correct?

image.png

Thank you for your assistance.

chenxiaobooo changed discussion title from How to get the scores of task_group with sub-tasks? to Is the score computed by lm-eval-harness normalized?
Open LLM Leaderboard org

Hi @chenxiaobooo ,

Thank you for your question!

We normalise scores only to show at the Leaderboard on the results parsing stage, so lm-eval-harness doesn't use the normalisation described here.

I'm closing this issue, feel free to ping me here in case of any other questions about normalisation or please open a new discussion!

alozowski changed discussion status to closed

Sign up or log in to comment