Spaces:
Running
on
CPU Upgrade
Running
on
CPU Upgrade
Raw results to normalized results
#825
by
Ilyasch2
- opened
I am trying to find the normalized results from the RAW results on some models of the leaderboard. For tasks that do not have subtasks, like GPQA MMLU-PRO.... It works by just subtracting the random score and remapping to (0, 1). However, for task like BBH and MUSR, I tried a bunch of techniques, taking in account the number of samples per subtask but I am not able to find the right normalization. How can I recover MUSR from MUSR RAW.
Ilyasch2
changed discussion title from
Raw results to normalized results.
to Raw results to normalized results
Hi @Ilyasch2 ,
To normalise results for tasks with subtasks in leaderboards like MUSR, you can follow this idea:
- Define a normalisation function, for instance:
def normalize_within_range(value, lower_bound, higher_bound):
return (value - lower_bound) / (higher_bound - lower_bound)
- Calculate lower bound for each subtask. The lower bound is the score you would get with a random baseline, so the reciprocal of the number of choices for each subtask. The example for MUSR:
MUSR murder mysteries: 2 choices (lower_bound = 0.5)
MUSR object placement: 5 choices (lower_bound = 0.2)
MUSR team allocation: 3 choices (lower_bound = 0.333)
You can find num_choices for other benchmarks here in the doc.
- For each subtask, normalise the raw scores. If the raw score is below the lower bound, it's normalized to 0. Otherwise, apply the normalisation function and scale it to a percentage:
if raw_score < lower_bound:
normalized_score = 0
else:
normalized_score = normalize_within_range(raw_score, lower_bound, 1) * 100
- Average the normalised scores across subtasks to obtain the overall normalised score for MUSR.
For more details, please, check out our blog
alozowski
changed discussion status to
closed