Spaces: autoevaluate /model-evaluator Running

I can't choose a model to evaluate#11

by BDas - opened

Hello team, I can't choose a model to evaluate in my own dataset.Am I missing an metric that I need to add to the readme file?
Can you support me on this?
screenshot:

Hi @BDas thanks for trying out the evaluation tool!

To evaluate a model on your own dataset, you'll need to have the datasets listed among the datasets tag in the model card metadata. For example, here is how we linked the computer vision models to the dog_food dataset in the blog post:

The only exception to this rule is summarization and question answering datasets / models. There, we allow any dataset-model pair to be evaluated as this task is more flexible than classification / NER etc

Hi @lewtun and team, great stuff!

I finally found some time to test this feature with Rubrix.

My idea was to evaluate how well (or most probably how bad) some models work with out-of-distribution data.

For that, I've labeled a small dataset for stance detection in tweets about climate change, specifically around wildfires (https://huggingface.co/datasets/rubrix/wildfire_tweets). This dataset follows the label scheme and general annotation guidelines for the stance-climate tweeteval dataset, but the dataset is potentially out of distribution for a number of reasons (recency, etc.).

Now I wanted to evaluate this model (https://huggingface.co/cardiffnlp/twitter-roberta-base-stance-climate/discussions/1). I understand I needed to add my dataset to the model card metadata (I opened a discussion there).

My questions are:

1. Are there any plans to allow for more flexibility so users can evaluate models with new datasets (such as the one I've just created) or it is by design that there's a need to add the dataset to the model card? My thinking is that users might want to build their own datasets and evaluate different models.
2. At least for my use case, it feels weird to add the dataset to the list in the model card, as per the docs these are datasets that have been used to train the model.

Anyway, not a big deal but I wanted to share some of my initial thoughts on this great feature!

Hey @dvilasuero, thanks for sharing this great feedback!

Are there any plans to allow for more flexibility so users can evaluate models with new datasets (such as the one I've just created) or it is by design that there's a need to add the dataset to the model card? My thinking is that users might want to build their own datasets and evaluate different models.

Indeed, this is a limitation of the current design. We chose this approach to ensure that the model <--> dataset correspondence made sense for tasks like text classification or NER, where the labels are fixed by the dataset. In other words, since most text classifiers / NER taggers can't be evaluated on datasets they weren't trained on, we decided to exclude them from the list of compatible models associated with a dataset.

Having said that, your use case is a perfectly valid one! One simple solution would be to allow users to toggle the list of compatible models to something like "Show all models" and then leave it to the user's discretion to pick the appropriate model(s). Would that be a better UX in your opinion?

Thanks Lewis!

Absolutely agree that supervised text/token classification models are usually tied to their original training sets, and for sure to the original label scheme. I also think some more general (sentiment, stance, emotion, topic, etc.) or even few/zero-shot models could be evaluated more broadly with datasets other than their original training sets.

One simple solution would be to allow users to toggle the list of compatible models to something like "Show all models" and then leave it to the user's discretion to pick the appropriate model(s). Would that be a better UX in your opinion?

That sounds good! I understand that this might raise some issues and leave the door open to loads of incompatible choices, at the very least users should provide a mapping from model output labels to dataset label ids (as in the evaluator.compute function).

Happy to discuss this further if you like!

I'd also like the 'show all models' feature. I've trained a translation model on ccmatrix (that does not have an official 'test' split) and would like to evaluate with Helsinki-NLP/tatoeba_mt.