Disaggregated Data

#73
by alexpeys - opened

Is the disaggregated data for this available at all? That is, just raw data of the format {model, task, question_id, correct/not_correct}, seems like it would contain a lot of information beyond the per-model/per-task averages.

Hugging Face H4 org

Do you mean the model generations/predictions?

Not even that, just the accuracy (or whatever metric) per question (rather than the average per task). I think those are all dumped as a json if you use the Eleuther eval harness...

I am also looking for data. Currently I would like to see MMLU broken down by task.

Hugging Face H4 org

We are not saving the individual results per question, but I think we do have the MMLU results broken down by task - we could add this to a later version of the leaderboard

Is there any particular reason to not make the dataset of results public ? It seems like it would be useful to the community as a whole to better understand these models. It also seems like a waste of resources for multiple groups to run the same evaluation harness and not share the full results.
I made a suggestion on EleutherAI's repo to collaborate with hugging face https://github.com/EleutherAI/lm-evaluation-harness/issues/662 . They do have some results of MMLU by task in their github repo and other results, but the number of models they have it for is currently pretty limited https://github.com/EleutherAI/lm-evaluation-harness/tree/master/results .

Hugging Face H4 org

When you talk about the dataset of results, are you talking about what's behind the leaderboard display, or about the detailed results we could generate?

I am talking about the data that is generated when you run the evaluation harness. I am not sure how you all are running the harness, but it can generate a json output like this one https://github.com/EleutherAI/lm-evaluation-harness/blob/master/results/llama/llama-30B/llama-30B_mmlu_5-shot.json. Sharing that data would be useful. Currently, I am most interested in the MMLU data, but I imagine the other evaluations have useful information beyond a single combined score as well.

Hugging Face H4 org

Oh yes, this is on our todo list! We are working on making our results repos public very soon!

Great! Thanks! 😁😁

Hugging Face H4 org

Hi @CoreyMorris and @alexpeys ! We released an upgrade of the leaderboard, and our results repo are now public!
Results are here, and detailed prompts are here

clefourrier changed discussion status to closed

Sign up or log in to comment