Spaces:

HuggingFaceH4
/

open_llm_leaderboard

Running on CPU Upgrade

App Files Files Community

748

Disaggregated Data

#73

by alexpeys - opened Jun 17, 2023

Discussion

alexpeys

Jun 17, 2023

Is the disaggregated data for this available at all? That is, just raw data of the format {model, task, question_id, correct/not_correct}, seems like it would contain a lot of information beyond the per-model/per-task averages.

clefourrier

Hugging Face H4 org Jun 20, 2023

Do you mean the model generations/predictions?

alexpeys

Jun 20, 2023

•

edited Jun 20, 2023

Not even that, just the accuracy (or whatever metric) per question (rather than the average per task). I think those are all dumped as a json if you use the Eleuther eval harness...

CoreyMorris

Jul 5, 2023

I am also looking for data. Currently I would like to see MMLU broken down by task.

clefourrier

Hugging Face H4 org Jul 6, 2023

We are not saving the individual results per question, but I think we do have the MMLU results broken down by task - we could add this to a later version of the leaderboard

CoreyMorris

Jul 6, 2023

Is there any particular reason to not make the dataset of results public ? It seems like it would be useful to the community as a whole to better understand these models. It also seems like a waste of resources for multiple groups to run the same evaluation harness and not share the full results.
I made a suggestion on EleutherAI's repo to collaborate with hugging face https://github.com/EleutherAI/lm-evaluation-harness/issues/662 . They do have some results of MMLU by task in their github repo and other results, but the number of models they have it for is currently pretty limited https://github.com/EleutherAI/lm-evaluation-harness/tree/master/results .

clefourrier

Hugging Face H4 org Jul 7, 2023

When you talk about the dataset of results, are you talking about what's behind the leaderboard display, or about the detailed results we could generate?

CoreyMorris

Jul 7, 2023

I am talking about the data that is generated when you run the evaluation harness. I am not sure how you all are running the harness, but it can generate a json output like this one https://github.com/EleutherAI/lm-evaluation-harness/blob/master/results/llama/llama-30B/llama-30B_mmlu_5-shot.json. Sharing that data would be useful. Currently, I am most interested in the MMLU data, but I imagine the other evaluations have useful information beyond a single combined score as well.

clefourrier

Hugging Face H4 org Jul 7, 2023

Oh yes, this is on our todo list! We are working on making our results repos public very soon!

CoreyMorris

Jul 7, 2023

Great! Thanks! 😁😁

clefourrier

Hugging Face H4 org Jul 13, 2023

Hi @CoreyMorris and @alexpeys ! We released an upgrade of the leaderboard, and our results repo are now public!
Results are here, and detailed prompts are here

clefourrier changed discussion status to closed Jul 13, 2023

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment