This PR proposes to align task name and type for the self-reported evaluation with the Hub taxonomy (i.e. the high-level tasks defined in hf.co/models)
The self-reported results will then become visible on this PwC leaderboard: https://paperswithcode.com/sota/summarization-on-samsum
why don't you just group all the metrics into the same (task, dataset) tuple, then? would be cleaner, no?
Yes it would be cleaner that way, but self-reported evaluations rarely specify the dataset config / split that was used. This means you can't group the verified and self-reported metrics under a single
A unique grouping would be something like
(task, dataset_id, dataset_config, dataset_split) - I'll double check if the
metadata_update() function from
huggingface_hub that we use automatically groups along those fields