Upload ONNX weights

#4
by Xenova HF staff - opened
No description provided.
spacemanidol changed pull request status to merged

For me the embeddings computed using the onnx version and the transformer/sentence-transformers version are significantly different. For the same sentence, their dot product similarity is below 0.8. This is more noticeable using long sentences sentences >6k tokens.

Has the ONNX version been tested? Can someone confirm or deny my results?

Could you provide example code? The fp32 weights passed validation (atol=1e-4) after exporting. Are you perhaps missing the normalization step?

Yes! I was running into this issue while running the model in a Triton server, but I could also replicate it locally. Here is the code:

import torch
from transformers import AutoModel, AutoTokenizer
from optimum.onnxruntime import ORTModelForFeatureExtraction  # type: ignore

import torch
from transformers import AutoModel, AutoTokenizer

text = """A grasshopper spent the summer hopping about in the sun and singing to his heart's content. One day, an ant went hurrying by, looking very hot and weary.

"Why are you working on such a lovely day?" said the grasshopper.

"I'm collecting food for the winter," said the ant, "and I suggest you do the same." And off she went, helping the other ants to carry food to their store. The grasshopper carried on hopping and singing. When winter came the ground was covered with snow. The grasshopper had no food and was hungry. So he went to the ants and asked for food.

"What did you do all summer when we were working to collect our food?" said one of the ants.

"I was busy hopping and singing," said the grasshopper.

"Well," said the ant, "if you hop and sing all summer, and do no work, then you must starve in the winter."""
tokenizer = AutoTokenizer.from_pretrained('Snowflake/snowflake-arctic-embed-m-long')
model = AutoModel.from_pretrained('Snowflake/snowflake-arctic-embed-m-long', trust_remote_code=True, add_pooling_layer=False)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
model.eval()

model_onnx = ORTModelForFeatureExtraction.from_pretrained('Snowflake/snowflake-arctic-embed-m-long', file_name="onnx/model.onnx", trust_remote_code=True, add_pooling_layer=False, safe_serialization=True)
model_onnx = model_onnx.to(device)
sentences = [" ".join(text*2000)]
tokenized = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt', max_length=8192)
tokenized = tokenized.to(device)
print(f"Number of tokens: {tokenized.input_ids.shape[1]}")

# Compute token embeddings
with torch.no_grad():
    sentence_embeddings = model(**tokenized)[0]
    sentence_embeddings_ort = model_onnx(**tokenized)[0]
    print(sentence_embeddings.shape, sentence_embeddings_ort.shape) # torch.Size([1, 8192, 768]) torch.Size([1, 8192, 768])
    sentence_embeddings = sentence_embeddings[:, 0]
    sentence_embeddings_ort = sentence_embeddings_ort[:, 0]

# Normalize embeddings
sentence_embeddings = torch.nn.functional.normalize(sentence_embeddings, p=2, dim=1)
sentence_embeddings_ort = torch.nn.functional.normalize(sentence_embeddings_ort, p=2, dim=1)

print(torch.allclose(sentence_embeddings, sentence_embeddings_ort, atol=1e-2)) # False

print(torch.dot(sentence_embeddings[0], sentence_embeddings_ort[0])) # 0.8255

I have similar results exporting the model myself using optimum cli (it also passes validation but some warnings appear during the conversion), in my case I also exported with rotary_scaling_factor=2 since I need the large context window and with--library sentence-transformers.

Heya, I experimented with this a bit too. First of all, I don't think [" ".join(text*2000)] is doing what you're thinking it's doing. I'd use e.g. [text * 10] instead to get 1 long text. I can get exactly the same results using Sentence Transformers and ONNX only up to ~2048 tokens, after that it diverges:

Number of tokens: 2012
True
tensor(1.0000, device='cuda:0')
Number of tokens: 2213
False
tensor(0.9837, device='cuda:0')
Number of tokens: 2414
False
tensor(0.7409, device='cuda:0')
  • Tom Aarsen

Oops, yes I meant to use something more like [" ".join([text]*2000)].

2048 is also the size that the authors mention that rotary_scaling_factor=2 should be passed at initizalization. Although in my experiments exporting the ONNX model with that parameter enabled in the config has the same result.

Also I see that that parameter is already set at config.json so maybe it is set by default?

Sign up or log in to comment