Onnx model doesn't produce embeddings close enough to SentenceTransformer version

#67

by luciancap001 - opened Jun 13, 2024

Jun 13, 2024

I'm attempting to use the Onnx version of the model to speed up inferencing, however I noticed that the values output by the model after applying the pooling & normalization do not match those produced by the model when using the SentenceTransformer version. They also fail to fall within acceptable boundaries of the outputs when using the AutoModel version if I execute each step manually. Specifically, I'm referencing this PyTorch tutorial (https://pytorch.org/tutorials/advanced/super_resolution_with_onnxruntime.html) which states the outputs should be within a range of atol = 1e-5 & rtol = 1e-3 when using the torch.isclose() method. Is this any explanation for this discrepancy & how much will this effect results downstream?

Gerald001

Jun 15, 2024

•

edited Jun 15, 2024

do you have a script to test the comparison?

luciancap001

Jun 17, 2024

•

edited Jun 17, 2024

It looks like it might be more than just the Onnx model being off, it appears that even using the AutoModel class will produce different results compared to the SentenceTransformer model's output. Thoughts?

import onnxruntime

import numpy as np
import torch.nn.functional as F

from sentence_transformers import SentenceTransformer
from transformers import AutoTokenizer, AutoModel
from sklearn.datasets import fetch_20newsgroups

def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

def main():
    
    #Obtain a list of strings to test the model
    docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data'][:100]

    #Load the tokenizer
    tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")

    ####################################################################################################
    # SentenceTransformer Model
    ####################################################################################################

    #Load the model using SentenceTransformer & embed the docs
    sent_trans_model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
    sent_trans_embeds = sent_trans_model.encode(docs)

    ####################################################################################################
    # Onnx Model
    ####################################################################################################

    # Tokenize sentences
    encoded_input = dict(tokenizer(docs, 
                                padding = True, 
                                truncation = True,
                                return_token_type_ids= False, 
                                return_tensors = 'pt'))

    #Create the Onnxruntime and do the forward pass with it
    ort_session = onnxruntime.InferenceSession("model.onnx", providers = ["CPUExecutionProvider"])
    ort_outs = ort_session.run(None, {k: v.cpu().numpy() for k, v in encoded_input.items()})
    ort_outs = [torch.tensor(i) for i in ort_outs]

    #Perform pooling & normalization on Onnx output
    onnx_embeds = F.normalize(mean_pooling(ort_outs, encoded_input['attention_mask']), p = 2, dim = 1).cpu().numpy()

    print("Onnx model embeddings close to SentenceTransformer embeddings: ", np.allclose(sent_trans_embeds, onnx_embeds, rtol=1e-03, atol=1e-05))

    ####################################################################################################
    # AutoModel
    ####################################################################################################

    #Load the model using the AutoModel class
    model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")

    #Compute token embeddings
    with torch.no_grad():
        model_output = model(**encoded_input)

    #Perform pooling & normalization on AutoModel output
    auto_embeds = F.normalize(mean_pooling(model_output, encoded_input['attention_mask']), p = 2, dim = 1).cpu().numpy()

    print("AutoModel embeddings close to SentenceTransformer embeddings: ", np.allclose(sent_trans_embeds, auto_embeds, rtol=1e-03, atol=1e-05))

if __name__ == "__main__":
    main()```

luciancap001

Jun 17, 2024

So it turns out the issue apparently was with the tokenizer all along. For some reason the tokenizer loaded with the AutoTokenizer class is using a max_length value of 512 but the model is expecting/was trained with 256. Setting this argument when calling the tokenizer passed the np.allclose() test.

tokenizer(docs, 
                     padding = True, 
                     truncation = True,
                     return_token_type_ids= False, 
                     return_tensors = 'pt',
                     max_length = 256)

luciancap001

Jun 17, 2024

Another update: running the Onnx model & passing "CUDAExecutionProvider" instead of "CPUExecutionProvider" when initializing the InferenceSession fails the test. Not sure if that's a result of a issue with the Onnx model itself or with the Onnx runtime

rcland12

Jul 12, 2024

I was having the same problem, thank you so much for this!

Gerald001

Jul 28, 2024

•

edited Jul 28, 2024

@rcland12 "I was having the same problem, thank you so much for this!" you have a problem with CUDAExecutionProvider?

what gpu's do you use for CUDAExecutionProvider?

can you try the following regarding onnx and CUDAExecutionProvider? https://medium.com/@transformergpt/how-to-convert-sentence-transformer-pytorch-models-to-onnx-with-the-right-pooling-method-61b1c83515d2

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment