evaluation of same model on multiple datasets leads to too many metrics and results get difficult to read #18

by MoritzLaurer - opened

I've started evaluating an NLI model here: https://huggingface.co/MoritzLaurer/DeBERTa-v3-base-mnli-fever-anli
I've evaluated on 3 test splits so far and for each split, the automatic pull request adds evaluations on 11 different metrics. We have around 5 NLI datasets on the hub, several of them have multiple test splits. If a good model is evaluated on all of them, this leads to more than 55++ metrics, which clutters the interface and makes it hard to see performance in one view. Could the number of default metrics be reduced? I think for NLI most datasets use accuracy as the default metrics and we don't really need precision, recall, F1 and macro/micro/weighted for each of them and loss etc. I feel like letting the user chose a single (or more) metrics to evaluate on would be fine.

Thanks for this really valuable feedback @MoritzLaurer !

I agree that the UI gets quite cluttered when a model is evaluated on many datasets / splits. One thing we're discussing internally is whether this is something that is best handled on the frontend. The main advantage of computing many metrics by default is that it enables a broader downstream analysis in the future (e.g. if people want to compare models on metrics other than accuracy). I'll report back here if we have any updates on the UI side