Spaces:
Running
on
CPU Upgrade
Disaggregated Data
Is the disaggregated data for this available at all? That is, just raw data of the format {model, task, question_id, correct/not_correct}, seems like it would contain a lot of information beyond the per-model/per-task averages.
Do you mean the model generations/predictions?
Not even that, just the accuracy (or whatever metric) per question (rather than the average per task). I think those are all dumped as a json if you use the Eleuther eval harness...
I am also looking for data. Currently I would like to see MMLU broken down by task.
We are not saving the individual results per question, but I think we do have the MMLU results broken down by task - we could add this to a later version of the leaderboard
Is there any particular reason to not make the dataset of results public ? It seems like it would be useful to the community as a whole to better understand these models. It also seems like a waste of resources for multiple groups to run the same evaluation harness and not share the full results.
I made a suggestion on EleutherAI's repo to collaborate with hugging face https://github.com/EleutherAI/lm-evaluation-harness/issues/662 . They do have some results of MMLU by task in their github repo and other results, but the number of models they have it for is currently pretty limited https://github.com/EleutherAI/lm-evaluation-harness/tree/master/results .
When you talk about the dataset of results, are you talking about what's behind the leaderboard display, or about the detailed results we could generate?
I am talking about the data that is generated when you run the evaluation harness. I am not sure how you all are running the harness, but it can generate a json output like this one https://github.com/EleutherAI/lm-evaluation-harness/blob/master/results/llama/llama-30B/llama-30B_mmlu_5-shot.json. Sharing that data would be useful. Currently, I am most interested in the MMLU data, but I imagine the other evaluations have useful information beyond a single combined score as well.
Oh yes, this is on our todo list! We are working on making our results repos public very soon!
Great! Thanks! 😁😁
Hi
@CoreyMorris
and
@alexpeys
! We released an upgrade of the leaderboard, and our results repo are now public!
Results are here, and detailed prompts are here