Text2Text Generation
Transformers
PyTorch
Safetensors
English
t5
Inference Endpoints
text-generation-inference

Files were created with this command:

$ optimum-cli export onnx \
  --model grammarly/coedit-large \
  --task text2text-generation-with-past \
  --optimize O3 \
  coedit-large-onnx/

There were a few warnings, but the diffs seem small enough:

Validation for the model coedit-large-onnx/decoder_model_merged.onnx raised: The maximum absolute difference between the output of the reference model and the ONNX exported model is not within the set tolerance 1e-05:
- present.0.encoder.key: max diff = 3.0517578125e-05
- present.2.decoder.key: max diff = 1.1920928955078125e-05
- present.2.decoder.value: max diff = 1.2740492820739746e-05
...
The ONNX export succeeded with the warning: The maximum absolute difference between the output of the reference model and the ONNX exported model is not within the set tolerance 1e-05:
- logits: max diff = 0.0004119873046875
- present.23.decoder.value: max diff = 6.103515625e-05.
 The exported model was saved at: coedit-large-onyx
...

I tested it with the code below. Note that ONNX is about ~1.8x faster on CPU than the transformers implementation.

In [1]: from transformers import AutoTokenizer, T5ForConditionalGeneration

In [2]: from optimum.onnxruntime import ORTModelForSeq2SeqLM

In [4]: model = ORTModelForSeq2SeqLM.from_pretrained('./onnx',  device="auto")

In [6]: torch_model = T5ForConditionalGeneration.from_pretrained("Grammarly/coedit-large")

In [7]: text = "Rewrite to make this easier to understand: A storm surge is what forecasters consider a hurricane's most treacherous aspect."

In [9]: tokenizer = AutoTokenizer.from_pretrained("grammarly/coedit-large")

In [10]: input_ids = tokenizer(text, return_tensors="pt").input_ids.to(model.device)

In [11]:  %time outputs = model.generate(input_ids=input_ids)
/opt/homebrew/Caskroom/miniconda/base/lib/python3.11/site-packages/transformers/generation/utils.py:1273: UserWarning: Using the model-agnostic default `max_length` (=20) to control the generation length. We recommend setting `max_new_tokens` to control the maximum length of the generation.
  warnings.warn(
CPU times: user 2.2 s, sys: 178 ms, total: 2.38 s
Wall time: 399 ms

In [12]: %time torch_outputs = torch_model.generate(input_ids=input_ids)
/opt/homebrew/Caskroom/miniconda/base/lib/python3.11/site-packages/transformers/generation/utils.py:1273: UserWarning: Using the model-agnostic default `max_length` (=20) to control the generation length. We recommend setting `max_new_tokens` to control the maximum length of the generation.
  warnings.warn(
CPU times: user 721 ms, sys: 28.7 ms, total: 750 ms
Wall time: 723 ms

In [13]: torch_outputs == outputs
Out[13]:
tensor([[True, True, True, True, True, True, True, True, True, True, True, True,
         True, True, True, True, True, True]])

In [14]: tokenizer.decode(outputs[0])
Out[14]: "<pad> It is what they consider to be a hurricane's most dangerous aspect.</s>"
Grammarly org
This comment has been hidden
jbochi changed pull request status to open
Ready to merge
This branch is ready to get merged automatically.

Sign up or log in to comment