Correct my model.onnx inference code?

#11
by williambarberjr - opened

The model looks really impressive and I'm eager to test it out but I could use some help sorting out if I'm using the model.onnx file correctly in this python code:

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from transformers import AutoTokenizer
import onnxruntime as ort

tokenizer = AutoTokenizer.from_pretrained('Snowflake/snowflake-arctic-embed-l')
onnx_model_path = "my/file/path/snowflake-arctic-embed-large.onnx"
ort_session = ort.InferenceSession(onnx_model_path, providers=['CUDAExecutionProvider'])

def batch_generator(texts, batch_size=32):
    """Yield batches of tokenized texts."""
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i+batch_size]
        inputs = tokenizer(batch, return_tensors="np", padding=True, truncation=True, max_length=512)
        inputs = {k: v.astype(np.int64) for k, v in inputs.items()}
        yield inputs

def encode(texts):
    """Encode texts using the ONNX model, processing in batches."""
    embeddings = []
    for batch_inputs in batch_generator(texts, batch_size=32):
        # Perform inference and collect embeddings
        outputs = ort_session.run(None, batch_inputs)
        # Assuming the first output is the embeddings
        batch_embeddings = outputs[0][:, 0, :]  # I believe this pulls the [CLS] token representation... but I could be mistaken
        embeddings.append(batch_embeddings)
    embeddings = np.concatenate(embeddings, axis=0)
    return embeddings

query_prefix = "Represent this sentence for searching relevant passages: "
queries = ['what is snowflake?', 'Where can I get the best tacos?']
queriesToEmbed = [query_prefix + query for query in queries]
documents = ['The Data Cloud!', 'Mexico City of Course!']

# Encode the documents and queries
doc_embeddings = encode(documents)
query_embeddings = encode(queriesToEmbed)  # Each query is treated individually but passed as batch

# Calculate and print the cosine similarities for each query against each document
similarities = cosine_similarity(query_embeddings, doc_embeddings)

for i, similarity_scores in enumerate(similarities):
    print("Query:", queries[i])
    for j, score in enumerate(similarity_scores):
        print(f"Score: {score} for Document {documents[j]}")

#prints:
# Query: what is snowflake?
# Score: 0.28976789 for Document The Data Cloud!
# Score: 0.19071159 for Document Mexico City of Course!
# Query: Where can I get the best tacos?
# Score: 0.38650593 for Document Mexico City of Course!
# Score: 0.25145516 for Document The Data Cloud!

#vs the expected:
# Query: what is snowflake?
# 0.28976774 The Data Cloud!
# 0.19071159 Mexico City of Course!
# Query: Where can I get the best tacos?
# 0.38650584 Mexico City of Course!
# 0.25145516 The Data Cloud!

Above is my code for using the provided model.onnx file (you'll need to add your path to the onnx downloaded model to make it run for you). The embeddings from using the sentence transformers library (which I assume uses the safetensors model) for the first query look like this:
[ 0.00784198 -0.06414595 -0.04445669 ... 0.00466895 0.01089635
-0.00057733]

While the embeddings produced from my code via the model.onnx file for the first query look like this:
[-0.13012639 -1.2006629 -0.74323857 ... 0.3901592 0.6858847
-0.7785025 ]

After resolving what may have been an issue with my having not having cuda 11.8 (I was running 12.1 I think) alongside onnxruntime version 1.17 (tried but couldn't make onnxruntime version 1.17 work with CUDA 12.2 even with installing via pip install onnxruntime-gpu --extra-index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-cuda-12/pypi/simple/), though ONNX still complains that I'm on Window 11 instead of 10. The cosine similarities are, however, very close so either it's just something the ONNX model is doing to handle inference more efficiently or I've done something wrong here. I just want to be sure before I run all my data through it. I would just use the torch implementation but when a user sends a query, my backend is going to be embedding that query on a smallish CPU and I'm under the impression that, in the absence of batch processing, ONNX would yield significantly faster (~2x faster) inference on a smallish CPU, but please correct me if I'm wrong I can't seem to find the notebook that worked through an example of this so I'm not sure how old the pytorch version was where I saw this benchmarked.

Hello!

The embeddings from Sentence Transformers are automatically normalized due to this module here. Computing the dot product of normalized embeddings is equivalent to computing cosine similarity, except dot product is simpler. That is why you're seeing the same similarities, but different absolute embeddings. If you normalize your embeddings (torch.nn.functional.normalize(embeddings) iirc), then you should get equivalent results.

Also, I want to warn you that ONNX is not strictly faster. I'm actually very curious about your findings in terms of speed! If you could, please let me know what you think. I'm quite curious about the performance of single queries as well as larger batches.

  • Tom Aarsen

The ONNX models are exported and validated according to an atol of 1e-4. Here's the example output for the medium variant:

Validating ONNX model models/Snowflake/snowflake-arctic-embed-m/model.onnx...
    -[✓] ONNX model output names match reference model (last_hidden_state)
    - Validating ONNX Model output "last_hidden_state":
        -[✓] (2, 16, 768) matches (2, 16, 768)
        -[✓] all values close (atol: 0.0001)

So, the slight differences (if any) in your outputs (0.28976789 vs 0.28976774, 0.19071159 vs 0.19071159, 0.38650593 vs 0.38650584, 0.25145516 vs 0.25145516) are due to floating point arithmetic and implementations differences between pytorch and ONNXRuntime, and should not be of concern to you.

As for the confusion between the ONNX output and the pytorch output, this is because (as pointed out by @tomaarsen ), you are comparing normalized vs. unnormalized embeddings. The cosine similarities are similar because the function itself normalizes the embeddings as part of the formula (and is equivalent to the dot product if both vectors are already normalized).

Got it, will see if I can make the embeddings match to enough significant figures after normalizing them @tomaarsen as far as the benchmark, took a while but I eventually found the benchmarking blog post I was referring to in my original post: https://ethen8181.github.io/machine-learning/model_deployment/transformers/text_classification_onnxruntime.html

For future reference for anyone that comes across this thread and wants the final correct code for normalizing the ONNX inference results with the batch processing setup, I've included it below with a quick comparison proving that the first dimension of the query is roughly the same across the methods. The timing results aren't meant as a real benchmark for a lot of reasons, but could be adjusted to benchmark on your own machine with the relevant optimizations applied to each setup.

import torch
from transformers import AutoModel, AutoTokenizer
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import onnxruntime as ort
import time
from sentence_transformers import SentenceTransformer

tokenizer = AutoTokenizer.from_pretrained('Snowflake/snowflake-arctic-embed-l')
model = AutoModel.from_pretrained('Snowflake/snowflake-arctic-embed-l', add_pooling_layer=False)
model.eval()

query_prefix = 'Represent this sentence for searching relevant passages: '
queries  = ['what is snowflake?', 'Where can I get the best tacos?']
queries_with_prefix = ["{}{}".format(query_prefix, i) for i in queries]
query_tokens = tokenizer(queries_with_prefix, padding=True, truncation=True, return_tensors='pt', max_length=512)

documents = ['The Data Cloud!', 'Mexico City of Course!']
document_tokens =  tokenizer(documents, padding=True, truncation=True, return_tensors='pt', max_length=512)

pt_start = time.time()
# Compute token embeddings
with torch.no_grad():
    query_embeddings_pt = model(**query_tokens)[0][:, 0]
    doument_embeddings_pt = model(**document_tokens)[0][:, 0]

# normalize embeddings
query_embeddings_pt = torch.nn.functional.normalize(query_embeddings_pt, p=2, dim=1)
doument_embeddings_pt = torch.nn.functional.normalize(doument_embeddings_pt, p=2, dim=1)

scores = torch.mm(query_embeddings_pt, doument_embeddings_pt.transpose(0, 1))
pt_end = time.time()
print(f"PyTorch time: {pt_end - pt_start}")
for query, query_scores in zip(queries, scores):
    doc_score_pairs = list(zip(documents, query_scores))
    doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)
    #Output passages & scores
    for document, score in doc_score_pairs:
        print(f"PyTorch: Score: {float(score):.8f} with Query {query}: '{document}'")

model = SentenceTransformer("Snowflake/snowflake-arctic-embed-l")

queries = ['what is snowflake?', 'Where can I get the best tacos?']
documents = ['The Data Cloud!', 'Mexico City of Course!']

st_start = time.time()
query_embeddings_st = model.encode(queries, prompt_name="query")
document_embeddings_st = model.encode(documents)

scores = query_embeddings_st @ document_embeddings_st.T
st_end = time.time()
print(f"Sentence Transformers time: {st_end - st_start}")
for query, query_scores in zip(queries, scores):
    doc_score_pairs = list(zip(documents, query_scores))
    doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)
    for document, score in doc_score_pairs:
        print(f"Sentence Transformers: Score: {float(score):.8f} with Query {query}: '{document}'")

tokenizer = AutoTokenizer.from_pretrained('Snowflake/snowflake-arctic-embed-l')

# Path to your ONNX model
onnx_model_path = "path/to/onnx/model/download/snowflake-arctic-embed-large.onnx"

# ONNX Runtime session
ort_session = ort.InferenceSession(onnx_model_path, providers=['CUDAExecutionProvider'])

def batch_generator(texts, batch_size=32):
    """Yield batches of tokenized texts."""
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i+batch_size]
        inputs = tokenizer(batch, return_tensors="np", padding=True, truncation=True, max_length=512)
        inputs = {k: v.astype(np.int64) for k, v in inputs.items()}
        yield inputs

def encode_onnx(texts):
    """Encode texts using the ONNX model, processing in batches and normalize the embeddings."""
    embeddings_onnx = []
    for batch_inputs in batch_generator(texts, batch_size=32):
        # Perform inference and collect embeddings
        outputs = ort_session.run(None, batch_inputs)
        # Assuming the first output is the embeddings
        batch_embeddings = outputs[0][:, 0, :]  # Pulling the [CLS] token representation
        # Normalize the embeddings
        norms = np.linalg.norm(batch_embeddings, axis=1, keepdims=True)
        normalized_embeddings = batch_embeddings / norms
        embeddings_onnx.append(normalized_embeddings)
    embeddings_onnx = np.concatenate(embeddings_onnx, axis=0)
    return embeddings_onnx

query_prefix = "Represent this sentence for searching relevant passages: "
queries = ['what is snowflake?', 'Where can I get the best tacos?']
queriesToEmbed = [query_prefix + query for query in queries]
documents = ['The Data Cloud!', 'Mexico City of Course!']

onnx_start = time.time()
# Encode the documents and queries using ONNX
doc_embeddings_onnx = encode_onnx(documents)
query_embeddings_onnx = encode_onnx(queriesToEmbed)  # Each query is treated individually but passed as batch

# Calculate and print the cosine similarities for each query against each document
similarities = cosine_similarity(query_embeddings_onnx, doc_embeddings_onnx)
onnx_end = time.time()
print(f"ONNX time: {onnx_end - onnx_start}")

for i, similarity_scores in enumerate(similarities):
    for j, score in enumerate(similarity_scores):
        print(f"ONNX: Score: {score:.8f} with Query {queries[i]}: for Document: '{documents[j]}'")

print(f"PyTorch dim 1 of first query embedding: {float(query_embeddings_pt[0][0])}")
print(f"Sentence Transformers dim 1 of first query embedding: {float(query_embeddings_st[0][0])}")
print(f"ONNX dim 1 of first query embedding: {float(query_embeddings_onnx[0][0])}")
print(f"ONNX dim 1 - Sentence Transformers dim 1: {float(query_embeddings_onnx[0][0]) - float(query_embeddings_st[0][0])}")
print(f"ONNX dim 1 - PyTorch dim 1: {float(query_embeddings_onnx[0][0]) - float(query_embeddings_pt[0][0])}")

#NOT MEANT AS A REAL BENCHMARK (not even sure if they're all using my GPU..)
# PyTorch time: 0.6768763065338135
# PyTorch: Score: 0.28976813 with Query what is snowflake?: 'The Data Cloud!'
# PyTorch: Score: 0.19071195 with Query what is snowflake?: 'Mexico City of Course!'
# PyTorch: Score: 0.38650578 with Query Where can I get the best tacos?: 'Mexico City of Course!'
# PyTorch: Score: 0.25145516 with Query Where can I get the best tacos?: 'The Data Cloud!'
# Sentence Transformers time: 0.1926882266998291
# Sentence Transformers: Score: 0.28976810 with Query what is snowflake?: 'The Data Cloud!'
# Sentence Transformers: Score: 0.19071192 with Query what is snowflake?: 'Mexico City of Course!'
# Sentence Transformers: Score: 0.38650587 with Query Where can I get the best tacos?: 'Mexico City of Course!'
# Sentence Transformers: Score: 0.25145522 with Query Where can I get the best tacos?: 'The Data Cloud!'
# ONNX time: 0.0952000617980957
# ONNX: Score: 0.28976789 with Query what is snowflake?: for Document: 'The Data Cloud!'
# ONNX: Score: 0.19071157 with Query what is snowflake?: for Document: 'Mexico City of Course!'
# ONNX: Score: 0.25145528 with Query Where can I get the best tacos?: for Document: 'The Data Cloud!'
# ONNX: Score: 0.38650602 with Query Where can I get the best tacos?: for Document: 'Mexico City of Course!'
# PyTorch dim 1 of first query embedding: -0.0056138490326702595
# Sentence Transformers dim 1 of first query embedding: -0.0056138490326702595
# ONNX dim 1 of first query embedding: -0.005613807123154402
# ONNX dim 1 - Sentence Transformers dim 1: 4.190951585769653e-08
# ONNX dim 1 - PyTorch dim 1: 4.190951585769653e-08

I think the first one is indeed not using your GPU (but it's fine - it should perform identically as Sentence Transformers), but it's very interesting to see the speed difference at a batch size of 1, e.g. also in the blogpost that you linked. I think this is actually quite a common use case. I'll consider how best to support ONNX via Sentence Transformers in a more native way. Thanks for the inspiration and details!

  • Tom Aarsen
Snowflake org

@williambarberjr can you make a PR with this as a suggestion on how to run inference with ONNX?

Sign up or log in to comment