Which metric are you using to rank models ?

by OmarMorsli - opened Feb 28, 2024

Discussion

OmarMorsli

Feb 28, 2024

Hello I would like to know which metric are you using to rank the models?

in the mteb leaderboard we can see the ranking by topic (ex: retrieval, reraking, etc...) and the used metric.

For retrieval they are using ndcg@10. Can you provide any data so I can get more detailled results of your benchmark?

mb-data96

Feb 28, 2024

Hi Omar,

As my understanding of how the MTEB benchmark is done, I think that the evaluation has been done only for the following tasks (classification, reranking, STS) cf. Model Version of Ordalie : https://huggingface.co/OrdalieTech/Solon-embeddings-large-0.1. It hasn't been benchmarked for retrieval tasks as you mentionned.
Then, they averaged the scores to get the benchmark used.
I think it is a good benchmark, but it would be better if we got a more comprehensive benchmark specifically for retrieval purposes.

Best,
Mahmoud

OmarMorsli

Feb 28, 2024

Hi Mahmoud,

Thanks for your inputs.
I think that it will appreciated if they can provide the benchmark results as raw data or try to merge with mteb leaderboard.

Omar,

mb-data96

Feb 28, 2024

Hi again,

I think that they used the usual metrics for each task. From the MTEB https://huggingface.co/spaces/mteb/leaderboard, I extrapolated and think that they used :

Accuracy for Classification
Mean Average Precision (MAP) for Reranking
Spearman correlation based on cosine similarity for STS

Since, there was not any benchmark on retrieval, ndcg@10 wasn't computed in the benchmark provided by Solon on their model card.

Please note that, 5 of 9 benchmarks are designed for classification purposes, 2 of 9 are for STS and the 2 of 9 are for reranking. Then, the scores where averaged (equally or proportionnally weighted?).
We can safely conclude that the benchmark shown in the model card tends to show that Ordalie Embeddings are better, on average, for these 3 tasks.

It would be great if the Ordalie Team, on top of the amazing work already done, could provide us with metrics on retrieval benchmark (But I do not know if there are french datasets for that).

Regards,

OmarMorsli

Feb 28, 2024

•

edited Feb 28, 2024

Hi again, thanks
I've also figured it out and got to the same conslusion.
Yes there are some french dataset for retrieval: ALoprof, BSARD, Mintaka, Syntec, and XPQA.
Best,
Omar

netapy

Ordalie Technologies org Apr 17, 2024

•

edited Apr 17, 2024

Hi ! Sorry for the late reply.
We used a few evaluation datasets from MTEB, Miracl and 2 of our own : for these depending on the tasks we use Accuracy, MAP and Pearson Correlation.
However we are now also referenced on the official French MTEB (by @lyon-nlp-group ) over at https://huggingface.co/spaces/mteb/leaderboard ("french" tab) for which the datasets and metrics are listed.
Cheers !

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment