Some questions about reproduceing this MTEB-result

by infgrad - opened Jun 23, 2024

Jun 23, 2024

HI, Cannot reproduce this MTEB-result, could you please re-check the model weight?
Besides, what is the pooling of this model, in your transformer code example, the pooling is last_token, however in your sentence-transformer code, the pooling is mean( loss 1_Pooling? The default pooling in sentence-transformers is mean. )

infgrad changed discussion title from Cannot reproduce this MTEB-result to Some questions about reproduceing this MTEB-result Jun 23, 2024

yliu279

Salesforce org Jun 26, 2024

•

edited Jun 26, 2024

Hi infgrad,

Thank you for bringing this to our attention. We have carefully reviewed your concerns and retested the model weights. We can confirm that these weights do indeed reproduce the MTEB results as expected.

Regarding the pooling method, we originally used last-token pooling. However, thanks to Tom's assistance, we have corrected the discrepancy in the sentence transformer implementation. The issue has been resolved, and the correct pooling method is now consistently applied.

Could you please retry using the last-token pooling and let us know the results you achieve? We are eager to further discuss and ensure the reproducibility of our model.

Best,
Ye

mbien

Jun 26, 2024

•

edited Jun 26, 2024

@yliu279 I cannot reproduce the results with sentence-transformers code (using latest one). When runing the eval of BEIR-FiQA on the curated, sentence-transformers-based pipeline, I get recall@10=17.111, where it should be 69.440. Sounds like pooling might be somehow wrong, could you confirm that padding&accessing last token works the same way across transformers and sentence-transformers pipelines?

infgrad

Jun 26, 2024

@yliu279 @mbien
Hi, I carefully reviewed my test codes and still get the same results. Here is minimal reproduction code:


import functools
import os
from mteb import MTEB
from sentence_transformers import SentenceTransformer

if __name__ == "__main__":
    # load model
    model = SentenceTransformer("/mnt/hwdata/ip/nlp/public_models/SFR-Embedding-2_R", device="cuda")
    model.encode = functools.partial(
        model.encode,
        batch_size=8,
        show_progress_bar=True,
        prompt="Instruct: Retrieve semantically similar text.\nQuery: "  # only test STS
    )

    evaluation = MTEB(tasks=["STSBenchmark"], task_langs=["en"])
    evaluation.run(
        model,
        output_folder=f"sts_results",
        eval_splits=["test"],
        verbosity=2,
        overwrite_results=True,
    )

The result is:

{
  "dataset_revision": "b0fddb56ed78048fa8b90373c8a3cfc37b684831",
  "evaluation_time": 84.45004653930664,
  "kg_co2_emissions": null,
  "mteb_version": "1.12.48",
  "scores": {
    "test": [
      {
        "cosine_pearson": 0.701287466112608,
        "cosine_spearman": 0.7236247747370012,
        "euclidean_pearson": 0.7204492422443474,
        "euclidean_spearman": 0.7233661781589509,
        "hf_subset": "default",
        "languages": [
          "eng-Latn"
        ],
        "main_score": 0.7236247747370012,
        "manhattan_pearson": 0.7397991758538812,
        "manhattan_spearman": 0.7408110026601888,
        "pearson": [
          0.7012874418016998,
          1.2264194202428853e-204
        ],
        "spearman": [
          0.7236247747370012,
          5.439385473263533e-224
        ]
      }
    ]
  },
  "task_name": "STSBenchmark"
}

However, in your README.md, the result is:

  - task:
      type: STS
    dataset:
      type: mteb/stsbenchmark-sts
      name: MTEB STSBenchmark
      config: default
      split: test
      revision: b0fddb56ed78048fa8b90373c8a3cfc37b684831
    metrics:
    - type: cos_sim_pearson
      value: 83.55433725920493
    - type: cos_sim_spearman
      value: 83.60373857254014
    - type: euclidean_pearson
      value: 83.08086082334839
    - type: euclidean_spearman
      value: 83.6036864776559
    - type: manhattan_pearson
      value: 83.2232267589246
    - type: manhattan_spearman
      value: 83.78923946962664

~<

mbien

Jun 28, 2024

@yliu279 could you provide us some hint on reproducing the scores?

yliu279

Salesforce org Jun 28, 2024

•

edited Jun 28, 2024

Hi @infgrad @mbien

We noticed a discrepancy in the Sentence Transformer Evaluation. We are currently working on resolving this issue and will share the solution shortly. In the meantime, here is the process we use to produce the results. Please feel free to try it if you are interested:
Use E5 evaluation pipeline: https://github.com/microsoft/unilm/blob/master/e5/mteb_except_retrieval_eval.py
First Two editions in utils.py:

Add 'SFR-Embedding-2_R': 'instruction', to MODEL_NAME_TO_PREFIX_TYPE dict and 'SFR-Embedding-2_R': 'last' to MODEL_NAME_TO_POOL_TYPE in utils.py
revise create_batch_dict() function in utils.py as:

        batch_dict = tokenizer(
            input_texts,
            max_length=max_length - 1,
            return_token_type_ids=False,
            return_attention_mask=False,
            padding=False,
            truncation=True
        )

        # append eos_token_id to every input_ids
        batch_dict['input_ids'] = [input_ids + [tokenizer.eos_token_id] for input_ids in batch_dict['input_ids']]

        return tokenizer.pad(
            batch_dict,
            padding=True,
            pad_to_multiple_of=8,
            return_attention_mask=True,
            return_tensors="pt",
        )

Second:
In e5_mteb_except_retrieval_eval.py main() function:

model = DenseEncoder()
evaluation = MTEB(tasks=["STSBenchmark"], task_langs=["en"])
evaluation.run(
        model,
        output_folder=f"sts_results",
        eval_splits=["test"],
        verbosity=2,
        overwrite_results=True,
    )

You will get results as below:

{
  "dataset_revision": "b0fddb56ed78048fa8b90373c8a3cfc37b684831",
  "evaluation_time": 14.305434942245483,
  "kg_co2_emissions": null,
  "mteb_version": "1.12.48",
  "scores": {
    "test": [
      {
        "cosine_pearson": 0.8355240450842275,
        "cosine_spearman": 0.8360701599480195,
        "euclidean_pearson": 0.8307927408782112,
        "euclidean_spearman": 0.8360703731734451,
        "hf_subset": "default",
        "languages": [
          "eng-Latn"
        ],
        "main_score": 0.8360701599480195,
        "manhattan_pearson": 0.832215434631109,
        "manhattan_spearman": 0.8378697003913586,
        "pearson": 0.8355240450842275,
        "spearman": 0.8360701599480195
      }
    ]
  },
  "task_name": "STSBenchmark"
}

yeliu918

Jun 28, 2024

Hi @infgrad @mbien ,

The Sentence Transformer evaluation is now functioning correctly. We have added "add_eos_token": true in the tokenizer_config.json. You can now obtain accurate results using the ST evaluation.

  "scores": {
    "test": [
      {
        "cosine_pearson": 0.8355526890934296,
        "cosine_spearman": 0.8360173852997346,
        "euclidean_pearson": 0.830706240702224,
        "euclidean_spearman": 0.8365412824235895,
        "hf_subset": "default",
        "languages": [
          "eng-Latn"
        ],
        "main_score": 0.8360173852997346,
        "manhattan_pearson": 0.8318737804127988,
        "manhattan_spearman": 0.8380955443197002,
        "pearson": [
          0.8355526691849025,
          0.0
        ],
        "spearman": [
          0.8360186564578723,
          0.0
        ]
      }
    ]
  },
  "task_name": "STSBenchmark"

infgrad

Jun 29, 2024

Great!
I can now obtain accurate results in all MTEB tasks.

infgrad changed discussion status to closed Jun 29, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment