"already evaluated" Bug? #1

by Tristan - opened

It looks like, when I select a dataset and model that has already been evaluated, I get a message that says that the evaluation has already happened. I think that this is generally a good thing to do. But I get this message even if the metrics I select are different than last time. Is this intentional?

Great question! The current hash I use for determining whether a model has been evaluated is based on:

  • The task
  • The model ID
  • The dataset name
  • The dataset config
  • The dataset split

So in your case, the model is considered to be "evaluated" even if the user-specific metrics differ across jobs.

I think the natural fix here is to include a list of metrics to the hash. However, this adds some complexity because we will end up having Hub PRs with duplicate evaluation info ( i.e. the default metrics like accuracy and F1 score will be appended to the model card twice.)

So to make this clean, I think we'd need a mechanism in the backend that:

  • Checks which metrics have already been computed in previous jobs
  • Filters for just the new metrics

cc @abhishek

We now include the list of user metrics in the hash to determine previously evaluated models. This will catch the majority of duplicate jobs, but extra work is needed in the backend to ensure cases like:

  • User A selects metric X
  • User B selects metric Y
  • User C selects metrics [X,Y]

In the current approach, the last scenario will be treated as a new evaluation vs the previous two => duplicate metrics.