Discrepancy in Model Outputs Between Transformers and Sentence Transformers
#29
by
vatolinalex
- opened
Hello,
I have observed some discrepancies in the output scores when using the NVIDIA models with the transformers library versus the sentence_transformers library. Specifically, this issue occurs with the following models:
- nvidia/NV-Embed-v1
- nvidia/NV-Embed-v2
Here is code to reproduce
import torch.nn.functional as F
from sentence_transformers import SentenceTransformer
from sklearn.metrics import mean_squared_error
from transformers import AutoModel
def encode_with_sentence_transformer(queries, documents, task, model_name):
prompt = f"Instruct: {task}\nQuery: "
model = SentenceTransformer(model_name, trust_remote_code=True)
model.max_seq_length = 32768
model.tokenizer.padding_side = "right"
def add_eos(input_examples):
input_examples = [
input_example + model.tokenizer.eos_token for input_example in input_examples
]
return input_examples
batch_size = 2
query_embeddings = model.encode(
add_eos(queries), batch_size=batch_size, prompt=prompt, normalize_embeddings=True
)
passage_embeddings = model.encode(
add_eos(documents), batch_size=batch_size, prompt="", normalize_embeddings=True
)
scores = (query_embeddings @ passage_embeddings.T) * 100
return scores
def encode_with_auto_model(queries, documents, task, model_name):
prompt = f"Instruct: {task}\nQuery: "
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
max_length = 32768
query_embeddings = model.encode(queries, instruction=prompt, max_length=max_length)
passage_embeddings = model.encode(documents, instruction="", max_length=max_length)
query_embeddings = F.normalize(query_embeddings, p=2, dim=1)
passage_embeddings = F.normalize(passage_embeddings, p=2, dim=1)
scores = (query_embeddings @ passage_embeddings.T) * 100
return scores.detach().numpy()
if __name__ == "__main__":
task = "Given a question, retrieve passages that answer the question"
queries = [
'are judo throws allowed in wrestling?',
'how to become a radiology technician in michigan?'
]
passages = [
"Since you're reading this, you are probably someone from a judo background or someone who is just wondering how judo techniques can be applied under wrestling rules. So without further ado, let's get to the question. Are Judo throws allowed in wrestling? Yes, judo throws are allowed in freestyle and folkstyle wrestling. You only need to be careful to follow the slam rules when executing judo throws. In wrestling, a slam is lifting and returning an opponent to the mat with unnecessary force.",
"Below are the basic steps to becoming a radiologic technologist in Michigan:Earn a high school diploma. As with most careers in health care, a high school education is the first step to finding entry-level employment. Taking classes in math and science, such as anatomy, biology, chemistry, physiology, and physics, can help prepare students for their college studies and future careers.Earn an associate degree. Entry-level radiologic positions typically require at least an Associate of Applied Science. Before enrolling in one of these degree programs, students should make sure it has been properly accredited by the Joint Review Committee on Education in Radiologic Technology (JRCERT).Get licensed or certified in the state of Michigan."
]
scores1 = encode_with_sentence_transformer(queries, passages, task, "nvidia/NV-Embed-v2")
scores2 = encode_with_auto_model(queries, passages, task, "nvidia/NV-Embed-v2")
mse1_2 = mean_squared_error(scores1, scores2)
print("Scores from SentenceTransformer model:")
print(scores1.tolist())
print("Scores from AutoModel:")
print(scores2.tolist())
print(f"MSE between SentenceTransformer and AutoModel: {mse1_2}")
# Scores from SentenceTransformer model:
# [[87.96675872802734, 0.4764425456523895], [1.0268232822418213, 86.3516616821289]]
# Scores from AutoModel:
# [[87.42693328857422, 0.4628346264362335], [0.9652639031410217, 86.0372314453125]]
# MSE between SentenceTransformer and AutoModel: 0.09856314957141876
Could someone provide clarification on why this discrepancy might exist?
What implementation of the model did you use to compute the scores on the MTEB benchmark?
Here is the relevant issue in the MTEB repository https://github.com/embeddings-benchmark/mteb/issues/1600