Files were created with this command:

$ optimum-cli export onnx \
  --model grammarly/coedit-large \
  --task text2text-generation-with-past \
  --optimize O3 \
  coedit-large-onnx/

There were a few warnings, but the diffs seem small enough:

Validation for the model coedit-large-onnx/decoder_model_merged.onnx raised: The maximum absolute difference between the output of the reference model and the ONNX exported model is not within the set tolerance 1e-05:
- present.0.encoder.key: max diff = 3.0517578125e-05
- present.2.decoder.key: max diff = 1.1920928955078125e-05
- present.2.decoder.value: max diff = 1.2740492820739746e-05
...
The ONNX export succeeded with the warning: The maximum absolute difference between the output of the reference model and the ONNX exported model is not within the set tolerance 1e-05:
- logits: max diff = 0.0004119873046875
- present.23.decoder.value: max diff = 6.103515625e-05.
 The exported model was saved at: coedit-large-onyx
...

I tested it with the code below. Note that ONNX is about ~1.8x faster on CPU than the transformers implementation.

In [1]: from transformers import AutoTokenizer, T5ForConditionalGeneration

In [2]: from optimum.onnxruntime import ORTModelForSeq2SeqLM

In [4]: model = ORTModelForSeq2SeqLM.from_pretrained('./onnx',  device="auto")

In [6]: torch_model = T5ForConditionalGeneration.from_pretrained("Grammarly/coedit-large")

In [7]: text = "Rewrite to make this easier to understand: A storm surge is what forecasters consider a hurricane's most treacherous aspect."

In [9]: tokenizer = AutoTokenizer.from_pretrained("grammarly/coedit-large")

In [10]: input_ids = tokenizer(text, return_tensors="pt").input_ids.to(model.device)

In [11]:  %time outputs = model.generate(input_ids=input_ids)
/opt/homebrew/Caskroom/miniconda/base/lib/python3.11/site-packages/transformers/generation/utils.py:1273: UserWarning: Using the model-agnostic default `max_length` (=20) to control the generation length. We recommend setting `max_new_tokens` to control the maximum length of the generation.
  warnings.warn(
CPU times: user 2.2 s, sys: 178 ms, total: 2.38 s
Wall time: 399 ms

In [12]: %time torch_outputs = torch_model.generate(input_ids=input_ids)
/opt/homebrew/Caskroom/miniconda/base/lib/python3.11/site-packages/transformers/generation/utils.py:1273: UserWarning: Using the model-agnostic default `max_length` (=20) to control the generation length. We recommend setting `max_new_tokens` to control the maximum length of the generation.
  warnings.warn(
CPU times: user 721 ms, sys: 28.7 ms, total: 750 ms
Wall time: 723 ms

In [13]: torch_outputs == outputs
Out[13]:
tensor([[True, True, True, True, True, True, True, True, True, True, True, True,
         True, True, True, True, True, True]])

In [14]: tokenizer.decode(outputs[0])
Out[14]: "<pad> It is what they consider to be a hurricane's most dangerous aspect.</s>"
This comment has been hidden
jbochi changed pull request status to open
Ready to merge
This branch is ready to get merged automatically.

Sign up or log in to comment