Spaces:
Running
on
CPU Upgrade
Is the score computed by lm-eval-harness normalized?
As shown in the images below, the lm-evaluation-harness computes the metrics for sub-tasks (e.g., leaderboard_gpqa_diamond/extended/main). Is the score normalized using the method introduced by https://huggingface.co/docs/leaderboards/open_llm_leaderboard/normalization? If I am not mistaken, I only need to calculate the average score of all sub-tasks to get the metric for each task, correct?
Thank you for your assistance.
Hi @chenxiaobooo ,
Thank you for your question!
We normalise scores only to show at the Leaderboard on the results parsing stage, so lm-eval-harness
doesn't use the normalisation described here.
I'm closing this issue, feel free to ping me here in case of any other questions about normalisation or please open a new discussion!