BVRA/FungiCLEF2024 · Weighting of metrics and inclusion of balanced/macro accuracy

Apr 13

•

Hi,

Unless I missed it, I don't see an explanation of the relative weight of the metrics.
Additionally, https://github.com/BohemianVRA/FGVC-Competitions/blob/main/FungiCLEF2023/evaluate.py includes "Track 4: Classification Error with Special Cost for Unknown" but not accuracy, while the public leaderboard includes balanced (?) accuracy like the dataset page suggests, but not the Track 4 metric.

Can you clarify both points?

Thank you.

jack-etheredge

Apr 13

•

edited Apr 13

An additional point of clarification regarding evaluation:
"Note that if the species distribution was the same in the test set, this would be equivalent to macro-averaged accuracy; but it is not the case of the FungiCLEF 2023 test set."
Should competitors assume the class distribution in the test set is closer to the validation set than it is to training set (other than the unknown class, of course) or should we not make such assumption?

picekl

Bohemian Visual Recognition Alliance org Apr 14

Hi @jack-etheredge ,

About the distribution: We have been continuously collecting data for 20 years, so we took advantage of that and sampled train/val/test subsets from different years; train = 2000-2020, val = 2021, and test = 2022. Therefore, there is a considerable domain shift. However, I would agree that the validation set is closer (distribution-wise) to the test set as there is a smaller time gap in the data. You can read much more about it in our report from last year.

About the metrics: Good eye. We removed the Metric 4 at the last minute, and I did not remove it from the description. I will update it later today. We removed it as it was poorly designed and allowed a very simple solution. Besides, keep in mind that Accuracy and F1 are on the leaderboard just for "standard score verification."

One important note: you do not have to prepare one solution for all the metrics; you can use 3 different ones.

Best,
Lukas

jack-etheredge

Apr 14

•

edited Apr 14

Hi Lukas,
Thank you for the reply. The data collection and time series element makes sense to me, but I'm a bit confused about the evaluation metrics still.
My assumption is that the final rank is based on a weighted average of the different evaluation metrics. Is that not true?
I don't understand how we could use 3 different solutions for 3 different metrics since my understanding is that we will generate a submission.csv using a script and model that we provide. That script will then be run on an evaluation container. Since all the evaluation metrics will be calculated against that single submission.csv, I don't know how to reconcile the ideal that we could use a solution per metric. Presumably my understanding is incomplete.
Am I to understand from your statement that "Accuracy and F1 are on the leaderboard just for 'standard score verification.'," that only the Track1-Track3 metrics will matter to the final ranking?
Thanks again,
Jack

picekl

Bohemian Visual Recognition Alliance org Apr 14

Hi Jack,

Oh, I got it now. There is nothing like a weighted average. These are three separate tracks, i.e., three leaderboards. Just Huggingface is not built to display it separately.
We want participants to be creative and propose different ideas for different scenarios. All three scenarios are scientifically interesting as they reflect real and existing problems.

Hope this helps.

Best,
Lukas

glhr

Apr 16

On a similar note, could you clarify how the metrics in the Dataset description (numbered from 1 to 4) correspond to Track 1-3 metrics? It sounds like "Standard Classification with "unknown" category" is not one of the tracks. So is it:
Track 1: "Cost for confusing edible species for poisonous and vice versa."
Track 2: "A user-focused loss composes of both the classification error and the poisonous/edible confusion."
Track 3: "Increasing the weight of performance on rare species."
Or am I misunderstanding something?
Thanks

picekl

Bohemian Visual Recognition Alliance org Apr 16

@glhr ,

I just updated the description. In short, the fourth metric should not be included in the description.

Track 1 -- Standard Classification with "unknown" category.
Track 2 -- Cost for confusing edible species for poisonous and vice versa.
Track 3 -- A user-focused loss composes of both the classification error and the poisonous/edible confusion.

Even though I mentioned earlier that there is no "primary Track," you can feel that Track 3 is more important for evaluating the "general" performance.

Best,
Lukas