nvidia/NV-Embed-v2 · Discrepancy in Model Outputs Between Transformers and Sentence Transformers

Hello,

I have observed some discrepancies in the output scores when using the NVIDIA models with the transformers library versus the sentence_transformers library. Specifically, this issue occurs with the following models:

nvidia/NV-Embed-v1
nvidia/NV-Embed-v2

Here is code to reproduce

import torch.nn.functional as F
from sentence_transformers import SentenceTransformer
from sklearn.metrics import mean_squared_error
from transformers import AutoModel

def encode_with_sentence_transformer(queries, documents, task, model_name):
    prompt = f"Instruct: {task}\nQuery: "

    model = SentenceTransformer(model_name, trust_remote_code=True)
    model.max_seq_length = 32768
    model.tokenizer.padding_side = "right"

    def add_eos(input_examples):
        input_examples = [
            input_example + model.tokenizer.eos_token for input_example in input_examples
        ]
        return input_examples

    batch_size = 2
    query_embeddings = model.encode(
        add_eos(queries), batch_size=batch_size, prompt=prompt, normalize_embeddings=True
    )
    passage_embeddings = model.encode(
        add_eos(documents), batch_size=batch_size, prompt="", normalize_embeddings=True
    )

    scores = (query_embeddings @ passage_embeddings.T) * 100
    return scores


def encode_with_auto_model(queries, documents, task, model_name):
    prompt = f"Instruct: {task}\nQuery: "

    model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
    max_length = 32768
    query_embeddings = model.encode(queries, instruction=prompt, max_length=max_length)
    passage_embeddings = model.encode(documents, instruction="", max_length=max_length)

    query_embeddings = F.normalize(query_embeddings, p=2, dim=1)
    passage_embeddings = F.normalize(passage_embeddings, p=2, dim=1)
    scores = (query_embeddings @ passage_embeddings.T) * 100
    return scores.detach().numpy()


if __name__ == "__main__":
    task = "Given a question, retrieve passages that answer the question"
    queries = [
        'are judo throws allowed in wrestling?',
        'how to become a radiology technician in michigan?'
        ]

    passages = [
        "Since you're reading this, you are probably someone from a judo background or someone who is just wondering how judo techniques can be applied under wrestling rules. So without further ado, let's get to the question. Are Judo throws allowed in wrestling? Yes, judo throws are allowed in freestyle and folkstyle wrestling. You only need to be careful to follow the slam rules when executing judo throws. In wrestling, a slam is lifting and returning an opponent to the mat with unnecessary force.",
        "Below are the basic steps to becoming a radiologic technologist in Michigan:Earn a high school diploma. As with most careers in health care, a high school education is the first step to finding entry-level employment. Taking classes in math and science, such as anatomy, biology, chemistry, physiology, and physics, can help prepare students for their college studies and future careers.Earn an associate degree. Entry-level radiologic positions typically require at least an Associate of Applied Science. Before enrolling in one of these degree programs, students should make sure it has been properly accredited by the Joint Review Committee on Education in Radiologic Technology (JRCERT).Get licensed or certified in the state of Michigan."
    ]

    scores1 = encode_with_sentence_transformer(queries, passages, task, "nvidia/NV-Embed-v2")
    scores2 = encode_with_auto_model(queries, passages, task, "nvidia/NV-Embed-v2")
    mse1_2 = mean_squared_error(scores1, scores2)
    print("Scores from SentenceTransformer model:")
    print(scores1.tolist())
    print("Scores from AutoModel:")
    print(scores2.tolist())
    print(f"MSE between SentenceTransformer and AutoModel: {mse1_2}")
    # Scores from SentenceTransformer model:
    # [[87.96675872802734, 0.4764425456523895], [1.0268232822418213, 86.3516616821289]]
    # Scores from AutoModel:
    # [[87.42693328857422, 0.4628346264362335], [0.9652639031410217, 86.0372314453125]]
    # MSE between SentenceTransformer and AutoModel: 0.09856314957141876

Could someone provide clarification on why this discrepancy might exist?
What implementation of the model did you use to compute the scores on the MTEB benchmark?
Here is the relevant issue in the MTEB repository https://github.com/embeddings-benchmark/mteb/issues/1600