Compared with the "e5-base" model, what is the main update in this "e5-base-v2" version?
Compared with the "e5-base" model, what is the main update in this "e5-base-v2" version?
Nothing new, v2 models are simply pre-trained on larger quantity and more diverse text pair datasets.
@bergum We do not have plans to release its unsupervised weights. Embedding models without supervised fine-tuning do not perform very well and are not suitable for out-of-box use cases. If you'd like to fine-tune from the unsupervised ones, you can build upon https://huggingface.co/intfloat/e5-base-unsupervised (also has small and large versions)
Thanks for confirming, @intfloat .
I'm asking because I can't reproduce the BEIR results reported in the paper or close to it. This could be explained if, by mistake, the wrong weights were uploaded.
With e5-base-v2 on TREC-COVID, I get 0.69633 ndcg_at_10, which is off compared to the .79 reported in the paper (which is a very good result for a dense model on TREC-COVID).
Edit:
It should be noted that this is on CPU; I haven't tested this on GPU yet, and only tested TREC-COVID.
python3 mteb_beir_eval.py --model-name-or-path intfloat/e5-base-v2
...
[2023-05-30 01:16:30,748 INFO] Evaluation for TRECCOVID on test took 89452.90 seconds
[2023-05-30 01:16:30,748 INFO] Scores: {'ndcg_at_1': 0.75, 'ndcg_at_3': 0.74397, 'ndcg_at_5': 0.73222, 'ndcg_at_10': 0.69633, 'ndcg_at_100': 0.52017, 'ndcg_at_1000': 0.48872, 'map_at_1': 0.00215, 'map_at_3': 0.00602, 'map_at_5': 0.00968, 'map_at_10': 0.01753, 'map_at_100': 0.09263, 'map_at_1000': 0.23437, 'recall_at_1': 0.00215, 'recall_at_3': 0.0065, 'recall_at_5': 0.01057, 'recall_at_10': 0.01961, 'recall_at_100': 0.12825, 'recall_at_1000': 0.46435, 'precision_at_1': 0.84, 'precision_at_3': 0.8, 'precision_at_5': 0.784, 'precision_at_10': 0.74, 'precision_at_100': 0.5326, 'precision_at_1000': 0.21844, 'mrr_at_1': 0.84, 'mrr_at_3': 0.91333, 'mrr_at_5': 0.91333, 'mrr_at_10': 0.91333, 'mrr_at_100': 0.91333, 'mrr_at_1000': 0.91333, 'evaluation_time': 89452.9}
@bergum The results in the paper correspond to https://huggingface.co/intfloat/e5-base instead of v2 models.
Your results are consistent with ours, which you can check at "Evaluation results" part of https://huggingface.co/intfloat/e5-base-v2 Note that software version and hardware could cause very minor differences.
By the way, the TREC COVID dataset is very small and has large performance fluctuations when fine-tuning with different random seeds. We mainly focus on the average results across all BEIR datasets.