Spaces:

open-llm-leaderboard
/

open_llm_leaderboard

Running on CPU Upgrade

App Files Files Community

1022

Is the score computed by lm-eval-harness normalized?

#1011

by chenxiaobooo - opened 16 days ago

Discussion

chenxiaobooo

16 days ago

•

edited 13 days ago

As shown in the images below, the lm-evaluation-harness computes the metrics for sub-tasks (e.g., leaderboard_gpqa_diamond/extended/main). Is the score normalized using the method introduced by https://huggingface.co/docs/leaderboards/open_llm_leaderboard/normalization? If I am not mistaken, I only need to calculate the average score of all sub-tasks to get the metric for each task, correct?

Thank you for your assistance.

chenxiaobooo changed discussion title from How to get the scores of task_group with sub-tasks? to Is the score computed by lm-eval-harness normalized? 13 days ago

alozowski

Open LLM Leaderboard org 2 days ago

Hi @chenxiaobooo ,

Thank you for your question!

We normalise scores only to show at the Leaderboard on the results parsing stage, so lm-eval-harness doesn't use the normalisation described here.

I'm closing this issue, feel free to ping me here in case of any other questions about normalisation or please open a new discussion!

alozowski changed discussion status to closed 2 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment