autoevaluate/model-evaluator · evaluation of same model on multiple datasets leads to too many metrics and results get difficult to read

I've started evaluating an NLI model here: https://huggingface.co/MoritzLaurer/DeBERTa-v3-base-mnli-fever-anli
I've evaluated on 3 test splits so far and for each split, the automatic pull request adds evaluations on 11 different metrics. We have around 5 NLI datasets on the hub, several of them have multiple test splits. If a good model is evaluated on all of them, this leads to more than 55++ metrics, which clutters the interface and makes it hard to see performance in one view. Could the number of default metrics be reduced? I think for NLI most datasets use accuracy as the default metrics and we don't really need precision, recall, F1 and macro/micro/weighted for each of them and loss etc. I feel like letting the user chose a single (or more) metrics to evaluate on would be fine.