Exporting model with optimum, but optimum does not take pooling and dense layers into account.

#4
by canavar - opened

running optimum-cli export onnx -m sentence-transformers/clip-ViT-B-32-multilingual-v1 \ --task feature-extraction models/clip_vit_multilingual_onnx

with following output;

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Framework not specified. Using pt to export to ONNX.
Using the export variant default. Available variants are:
    - default: The default ONNX variant.
Using framework PyTorch: 2.1.0
[...transformers/models/distilbert/modeling_distilbert.py:223): TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
  mask, torch.tensor(torch.finfo(scores.dtype).min)
Post-processing the exported models...
Deduplicating shared (tied) weights...
Validating ONNX model models/clip_vit_multilingual_onnx/model.onnx...
    -[βœ“] ONNX model output names match reference model (last_hidden_state)
    - Validating ONNX Model output "last_hidden_state":
        -[βœ“] (2, 16, 768) matches (2, 16, 768)
        -[βœ“] all values close (atol: 0.0001)
The ONNX export succeeded and the exported model was saved at: models/clip_vit_multilingual_onnx

importing the code and running it;

from transformers import AutoTokenizer
from optimum.onnxruntime import ORTModelForFeatureExtraction

tokenizer = AutoTokenizer.from_pretrained("models/clip_vit_multilingual_onnx")
model = ORTModelForFeatureExtraction.from_pretrained("models/clip_vit_multilingual_onnx")

inputs = tokenizer("What am I using?", "Using DistilBERT with ONNX Runtime!", return_tensors="pt")
outputs = model(**inputs)
last_hidden_state = outputs.last_hidden_state
list(last_hidden_state.shape)

gives the output slightly(export output: (2, 16, 768), actual: [1, 21, 768]) confirming the export command output:

[1, 21, 768]

If I use the same model from sentence-transformers:

from sentence_transformers import SentenceTransformer, util
import torch

text_model = SentenceTransformer('sentence-transformers/clip-ViT-B-32-multilingual-v1')
texts = [
    "What am I using?", "Using DistilBERT with ONNX Runtime!", # Spanish: a beach with palm trees
]
text_embeddings = text_model.encode(texts)
print(text_embeddings.shape)

output:

(2, 512)

as documented in: https://huggingface.co/sentence-transformers/clip-ViT-B-32-multilingual-v1;

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: DistilBertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Dense({'in_features': 768, 'out_features': 512, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
)

From model card, sentence-transformers/clip-ViT-B-32-multilingual-v1 pytorch model has pooling and dense layers, which converts the output format from dim 768 to 512(on dense layer) from modules.json.

In case of exporting from optimum-cli, apparently it ignores pooling and dense part, resulting an output shape of [1, 21, 768].

Looking at the optimum library, I can't seem to find any information about pooling and dense layer inclusion. optimum-cli will not obviously check 1_pooling or 2_dense folders at https://huggingface.co/sentence-transformers/clip-ViT-B-32-multilingual-v1/tree/main next to pytorch_model.bin

I am looking for a way to combine 1_pooling and 2_dense layers into exported onnx. I tried to create a new model including all the layers, Transformer layer coming from pretrained model, and pooling and dense layer newly created;

word_embedding_model = AutoModel.from_pretrained("sentence-transformers/clip-ViT-B-32-multilingual-v1")
pooling_model = models.Pooling(768, 
                               pooling_mode_mean_tokens=True, 
                               pooling_mode_cls_token=False, 
                               pooling_mode_max_tokens=False,
                               pooling_mode_mean_sqrt_len_tokens=False)
dense_model = models.Dense(in_features=768, 
                           out_features=512,
                           bias=False, 
                           activation_function=nn.Identity())

model = SentenceTransformer(modules=[word_embedding_model, pooling_model, dense_model])

tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/clip-ViT-B-32-multilingual-v1")

inputs = tokenizer("What am I using?", "Using DistilBERT with ONNX Runtime!", return_tensors="pt")

input_ids = inputs['input_ids']
input_names = list(inputs.keys())

print(input_names)
# ['input_ids', 'attention_mask']

# forward 
outputs = model(input_ids)

when outputs = model(input_ids) run, I get the following error;

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
download_optimize_model.ipynb Cell 30 line 3
     28 print(input_names)
     30 # Export the model
---> 31 outputs = model(input_ids)
     32 last_hidden_state = outputs.last_hidden_state
     34 # Export the model

File /opt/conda/envs/embed/lib/python3.11/site-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
   1516     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1517 else:
-> 1518     return self._call_impl(*args, **kwargs)

File /opt/conda/envs/embed/lib/python3.11/site-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
   1522 # If we don't have any hooks, we want to skip the rest of the logic in
   1523 # this function, and just call forward.
   1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1525         or _global_backward_pre_hooks or _global_backward_hooks
   1526         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527     return forward_call(*args, **kwargs)
   1529 try:
   1530     result = None

File /opt/conda/envs/embed/lib/python3.11/site-packages/torch/nn/modules/container.py:215, in Sequential.forward(self, input)
...
--> 401     return inner_dict[k]
    402 else:
    403     return self.to_tuple()[k] 


KeyError: 'token_embeddings'

The question is, can I combine 3 layers into 1 and export that single model consists of tranformer, pooling and dense layers?

OK, finally I got it working. hf optimum onnx export can export this model with (0) Transformer and (1) Pooling. But it can not extend with provided dense layer. What I have done is, I created a model that combines 3 layers as follows;

CombinedModel

from sentence_transformers import SentenceTransformer
from sentence_transformers import models
import torch
import torch.nn as nn
import onnx
import numpy as np

class CombinedModel(nn.Module):
    def __init__(self, transformer_model, dense_model):
        super(CombinedModel, self).__init__()
        self.transformer = transformer_model
        self.dense = dense_model

    def forward(self, input_ids, attention_mask):
        outputs = self.transformer({'input_ids': input_ids, 'attention_mask': attention_mask})
        token_embeddings = outputs['token_embeddings']
        dense_output = self.dense({'sentence_embedding': token_embeddings})
        dense_output_tensor = dense_output['sentence_embedding']
        
        ### this was important for me. it took me a bit to figure out that original model takes the mean of dense output
        mean_output = torch.mean(dense_output_tensor, dim=1)
        flattened_output = mean_output.squeeze(0)
        return flattened_output

Combine dense with original model

transformer_model = SentenceTransformer('clip-ViT-B-32-multilingual-v1', cache_folder='model_pytorch')
tokenizer = transformer_model.tokenizer

### this is from dense model configuration
dense_model = models.Dense(
    in_features=768,
    out_features=512,
    bias=False,
    activation_function= nn.Identity()
)

### load the weights from dense model binary
state_dict = torch.load('model_pytorch/sentence-transformers_clip-ViT-B-32-multilingual-v1/2_Dense/pytorch_model.bin')
dense_model.load_state_dict(state_dict)

model = CombinedModel(transformer_model, dense_model)

Export combined model to onnx

model.eval()

input_text = "This is a multi-lingual version of the OpenAI CLIP-ViT-B32 model. You can map text (in 50+ languages) and images to a common dense vector space such that images and the matching texts are close."

inputs = tokenizer(input_text, padding='longest', truncation=True, max_length=128, return_tensors='pt')
input_ids = inputs['input_ids']
attention_mask = inputs['attention_mask']

# Export the model
torch.onnx.export(model,               # model being run
                  (input_ids, attention_mask), # model input (or a tuple for multiple inputs)
                  "combined_model.onnx", # where to save the model (can be a file or file-like object)
                  export_params=True,        # store the trained parameter weights inside the model file
                  opset_version=17,          # the ONNX version to export the model to
                  do_constant_folding=True,  # whether to execute constant folding for optimization
                  input_names = ['input_ids', 'attention_mask'],   # the model's input names
                  output_names = ['output'], # the model's output names
                  dynamic_axes={'input_ids': {0 : 'batch_size', 1: 'seq_length'},    # variable length axes
                                'attention_mask': {0 : 'batch_size', 1: 'seq_length'},
                                'output' : {0 : 'batch_size'}})

onnx.checker.check_model("combined_model.onnx")
comdined_model = onnx.load("combined_model.onnx")

Compare both original and onnx model output;

import torch
import numpy as np
import onnxruntime as ort
from transformers import AutoTokenizer

model = SentenceTransformer('sentence-transformers/clip-ViT-B-32-multilingual-v1')
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/clip-ViT-B-32-multilingual-v1')

# Prepare the input
text = "This is an example sentence."
inputs = tokenizer(text, padding='longest', truncation=True, max_length=128, return_tensors='pt')

# Run the PyTorch model
pytorch_output =  model.encode(text, convert_to_tensor=True, device='cpu')

# Convert the inputs to numpy arrays for the ONNX model
inputs_onnx = {name: tensor.numpy() for name, tensor in inputs.items()}

# Run the ONNX model
sess = ort.InferenceSession("combined_model.onnx")
onnx_output = sess.run(None, inputs_onnx)

# Compare the outputs
print("Are the outputs close?", np.allclose(pytorch_output.detach().numpy(), onnx_output[0], atol=1e-6))

# Calculate the differences between the outputs
differences = pytorch_output.detach().numpy() - onnx_output[0]

# Print the standard deviation of the differences
print("Standard deviation of the differences:", np.std(differences))

print("pytorch_output size:", pytorch_output.size())
print("onnx_output size:", onnx_output[0].shape)

Output:

Are the outputs close? True
Standard deviation of the differences: 1.6167593e-07
pytorch_output size: torch.Size([512])
onnx_output size: (512,)

I would really like to contribute the onnx model, novices like me can use the onnx version easily.

Anyone in the void?

Sign up or log in to comment