Compared with the "e5-base" model, what is the main update in this "e5-base-v2" version?

#1
by Zihao - opened

Compared with the "e5-base" model, what is the main update in this "e5-base-v2" version?

Nothing new, v2 models are simply pre-trained on larger quantity and more diverse text pair datasets.

Hi @intfloat , does this repo have the unsupervised weights (Table 1 in the paper), or the weights from after without fine-tuning on MS Marco/BEIR (Table 2)? paper

@bergum We do not have plans to release its unsupervised weights. Embedding models without supervised fine-tuning do not perform very well and are not suitable for out-of-box use cases. If you'd like to fine-tune from the unsupervised ones, you can build upon https://huggingface.co/intfloat/e5-base-unsupervised (also has small and large versions)

Thanks for confirming, @intfloat .

I'm asking because I can't reproduce the BEIR results reported in the paper or close to it. This could be explained if, by mistake, the wrong weights were uploaded.

With e5-base-v2 on TREC-COVID, I get 0.69633 ndcg_at_10, which is off compared to the .79 reported in the paper (which is a very good result for a dense model on TREC-COVID).

Edit:
It should be noted that this is on CPU; I haven't tested this on GPU yet, and only tested TREC-COVID.

python3 mteb_beir_eval.py --model-name-or-path intfloat/e5-base-v2
...
[2023-05-30 01:16:30,748 INFO] Evaluation for TRECCOVID on test took 89452.90 seconds
[2023-05-30 01:16:30,748 INFO] Scores: {'ndcg_at_1': 0.75, 'ndcg_at_3': 0.74397, 'ndcg_at_5': 0.73222, 'ndcg_at_10': 0.69633, 'ndcg_at_100': 0.52017, 'ndcg_at_1000': 0.48872, 'map_at_1': 0.00215, 'map_at_3': 0.00602, 'map_at_5': 0.00968, 'map_at_10': 0.01753, 'map_at_100': 0.09263, 'map_at_1000': 0.23437, 'recall_at_1': 0.00215, 'recall_at_3': 0.0065, 'recall_at_5': 0.01057, 'recall_at_10': 0.01961, 'recall_at_100': 0.12825, 'recall_at_1000': 0.46435, 'precision_at_1': 0.84, 'precision_at_3': 0.8, 'precision_at_5': 0.784, 'precision_at_10': 0.74, 'precision_at_100': 0.5326, 'precision_at_1000': 0.21844, 'mrr_at_1': 0.84, 'mrr_at_3': 0.91333, 'mrr_at_5': 0.91333, 'mrr_at_10': 0.91333, 'mrr_at_100': 0.91333, 'mrr_at_1000': 0.91333, 'evaluation_time': 89452.9}

@bergum The results in the paper correspond to https://huggingface.co/intfloat/e5-base instead of v2 models.

Your results are consistent with ours, which you can check at "Evaluation results" part of https://huggingface.co/intfloat/e5-base-v2 Note that software version and hardware could cause very minor differences.

By the way, the TREC COVID dataset is very small and has large performance fluctuations when fine-tuning with different random seeds. We mainly focus on the average results across all BEIR datasets.

Perfect, @intfloat . Thank you for your time explaining this. I wrongly assumed v1 and v2 would be similar. I see now that the self-reported ndcg_at_10 is 69.596, which is close and easily explained. Thank you for publishing this work, and for making it easy to reproduce!

Sign up or log in to comment