Which metric are you using to rank models ?

#2
by OmarMorsli - opened

Hello I would like to know which metric are you using to rank the models?

in the mteb leaderboard we can see the ranking by topic (ex: retrieval, reraking, etc...) and the used metric.

For retrieval they are using ndcg@10. Can you provide any data so I can get more detailled results of your benchmark?

Hi Omar,

As my understanding of how the MTEB benchmark is done, I think that the evaluation has been done only for the following tasks (classification, reranking, STS) cf. Model Version of Ordalie : https://huggingface.co/OrdalieTech/Solon-embeddings-large-0.1. It hasn't been benchmarked for retrieval tasks as you mentionned.
Then, they averaged the scores to get the benchmark used.
I think it is a good benchmark, but it would be better if we got a more comprehensive benchmark specifically for retrieval purposes.

Best,
Mahmoud

Hi Mahmoud,

Thanks for your inputs.
I think that it will appreciated if they can provide the benchmark results as raw data or try to merge with mteb leaderboard.

Omar,

Hi again,

I think that they used the usual metrics for each task. From the MTEB https://huggingface.co/spaces/mteb/leaderboard, I extrapolated and think that they used :

  • Accuracy for Classification
  • Mean Average Precision (MAP) for Reranking
  • Spearman correlation based on cosine similarity for STS

Since, there was not any benchmark on retrieval, ndcg@10 wasn't computed in the benchmark provided by Solon on their model card.

Please note that, 5 of 9 benchmarks are designed for classification purposes, 2 of 9 are for STS and the 2 of 9 are for reranking. Then, the scores where averaged (equally or proportionnally weighted?).
We can safely conclude that the benchmark shown in the model card tends to show that Ordalie Embeddings are better, on average, for these 3 tasks.

It would be great if the Ordalie Team, on top of the amazing work already done, could provide us with metrics on retrieval benchmark (But I do not know if there are french datasets for that).

Regards,

Hi again, thanks
I've also figured it out and got to the same conslusion.
Yes there are some french dataset for retrieval: ALoprof, BSARD, Mintaka, Syntec, and XPQA.
Best,
Omar

Ordalie Technologies org
edited Apr 17

Hi ! Sorry for the late reply.
We used a few evaluation datasets from MTEB, Miracl and 2 of our own : for these depending on the tasks we use Accuracy, MAP and Pearson Correlation.
However we are now also referenced on the official French MTEB (by @lyon-nlp-group ) over at https://huggingface.co/spaces/mteb/leaderboard ("french" tab) for which the datasets and metrics are listed.
Cheers !

Sign up or log in to comment