Upload ONNX weights

#3
by Xenova HF staff - opened

Conversion code:

import os
import torch
from transformers import AutoModel, AutoTokenizer
from sklearn.preprocessing import normalize

query_prompt = "Instruct: Given a web search query, retrieve relevant passages that answer the query.\nQuery: "
queries = [
    "What are some ways to reduce stress?",
    "What are the benefits of drinking green tea?",
]
queries = [query_prompt + query for query in queries]
# docs do not need any prompts
docs = [
    "There are many effective ways to reduce stress. Some common techniques include deep breathing, meditation, and physical activity. Engaging in hobbies, spending time in nature, and connecting with loved ones can also help alleviate stress. Additionally, setting boundaries, practicing self-care, and learning to say no can prevent stress from building up.",
    "Green tea has been consumed for centuries and is known for its potential health benefits. It contains antioxidants that may help protect the body against damage caused by free radicals. Regular consumption of green tea has been associated with improved heart health, enhanced cognitive function, and a reduced risk of certain types of cancer. The polyphenols in green tea may also have anti-inflammatory and weight loss properties.",
]

# The path of your model after cloning it
model_dir = "./stella_en_400M_v5"

vector_dim = 1024
vector_linear_directory = f"2_Dense_{vector_dim}"
model = AutoModel.from_pretrained(model_dir, trust_remote_code=True, use_memory_efficient_attention=False, unpad_inputs=False).eval()
tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
vector_linear = torch.nn.Linear(in_features=model.config.hidden_size, out_features=vector_dim)
vector_linear_dict = {
    k.replace("linear.", ""): v for k, v in
    torch.load(os.path.join(model_dir, f"{vector_linear_directory}/pytorch_model.bin"), map_location=torch.device('cpu')).items()
}
vector_linear.load_state_dict(vector_linear_dict)
vector_linear.eval()

model.vector_linear = vector_linear
original_forward = model.forward
def patched_forward(input_ids, attention_mask, token_type_ids):
    last_hidden_state = original_forward(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)[0]
    last_hidden = last_hidden_state.masked_fill(~attention_mask[..., None].bool(), 0.0)
    query_vectors = last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
    return model.vector_linear(query_vectors)
model.forward = patched_forward

# Embed the queries
with torch.no_grad():
    input_data = tokenizer(queries, padding="longest", truncation=True, max_length=512, return_tensors="pt")
    outputs = model(**input_data)
    query_vectors = normalize(outputs.cpu().numpy())

# Embed the documents
with torch.no_grad():
    input_data = tokenizer(docs, padding="longest", truncation=True, max_length=512, return_tensors="pt")
    outputs = model(**input_data)
    docs_vectors = normalize(outputs.cpu().numpy())

print(query_vectors.shape, docs_vectors.shape)
# (2, 1024) (2, 1024)

similarities = query_vectors @ docs_vectors.T
print(similarities)
# [[0.8397531  0.29900077]
#  [0.32818374 0.80954516]]

Followed by

input_data = tokenizer(queries, padding="longest", truncation=True, max_length=512, return_tensors="pt")

# Export the model
torch.onnx.export(model,               # model being run
                  (input_data['input_ids'], input_data['attention_mask'], input_data['token_type_ids']), # model input (or a tuple for multiple inputs)
                  "model.onnx",   # where to save the model (can be a file or file-like object)
                  export_params=True,        # store the trained parameter weights inside the model file
                  opset_version=14,          # the ONNX version to export the model to
                  do_constant_folding=True,  # whether to execute constant folding for optimization
                  input_names = ['input_ids', 'attention_mask', 'token_type_ids'],   # the model's input names
                  output_names = ['sentence_embedding'], # the model's output names
                  dynamic_axes={
                    "input_ids": {0: "batch_size", 1: "sequence_length"},
                    "attention_mask": {0: "batch_size", 1: "sequence_length"},
                    "token_type_ids": {0: "batch_size", 1: "sequence_length"},
                    "sentence_embedding": {0: "batch_size"},
                  }
)

and then simplified with ONNXSlim

@Xenova Would it be possible to publish this to your account so that we users could use it without waiting for the PR to be merged? Or if you could kindly tell me how to pull a PR to my local machine that'd be greatly appreciated! (I have tried pulling the refs/pr/3 branch but it didn't work.)
Thank you very much in advance!

@netw0rkf10w in your code, you should be able to specify revision='refs/pr/3'. Which library are you running the model with? If Transformers.js, you can specify { revision: 'refs/pr/3' }as an option.

@Xenova Thanks for the prompt reply!
I'm using https://github.com/huggingface/text-embeddings-inference and I would like to download the model to a specific local folder before loading it for inference. I'm downloading your ONNX files manually for now but it would be great if HF could provide a way to checkout a PR branch.

@Xenova Unfortunately I obtained an error when loading your Onnx files:

2024-07-25T12:40:25.316131Z  INFO text_embeddings_router: router/src/lib.rs:241: Starting model backend
Error: Could not create backend

Caused by:
    Could not start backend: Failed to create ONNX Runtime session: Load model from /data/stella_en_400M_v5/onnx/model.onnx failed:/home/runner/work/onnxruntime-build/onnxruntime-build/onnxruntime/onnxruntime/core/graph/model.cc:179 onnxruntime::Model::Model(onnx::ModelProto&&, const PathString&, const IOnnxRuntimeOpSchemaRegistryList*, const onnxruntime::logging::Logger&, const onnxruntime::ModelOptions&) Unsupported model IR version: 10, max supported IR version: 9

I guess this is because you use IR version 10 to create the Onnx file, is that correct?

Right, you just need to upgrade your version of onnxruntime/onnx :)

@Xenova Thanks a lot for your help!

@netw0rkf10w I got around the unsupported model IR version error in text-embeddings-inference by updating ort from version 2.0.0-rc.2 to 2.0.0-rc.4 in backends/ort/Cargo.toml, then rebuilding the docker container using the project's Dockerfile.

However, now I've got the following error after it tries to load the model:

Unknown output keys: [Output { name: "sentence_embedding", output_type: Tensor { ty: Float32, dimensions: [-1, 1024] } }]

Anyone know how to get around this or has gotten this onnx version to run under TEI?

@randai2 Yes upgrading the ort package to 2.0.0-rc.4 seems to be the way to go. I also posted this suggestion in the TEI repo: https://github.com/huggingface/text-embeddings-inference/issues/355

Unfortunately I'm taken by some other urgent stuffs so I haven't tried that yet but I would suggest to ask the question in the TEI repo.

@randai2 Let's vote for the support of this model in TEI: https://github.com/huggingface/text-embeddings-inference/issues/359

@netw0rkf10w @randai2 i made a PR #361 that supports model IR version 10!

In the meantime, TEI gets the output from the layer named last_hidden_state (or token_embeddings) for the embedding model. you can check the code. So, to run the ONNX model with TEI, the output_names should be last_hidden_state (== the output of the original_forwardin the above code) instead of sentence_embedding i guess.

Ready to merge
This branch is ready to get merged automatically.

Sign up or log in to comment