optimum/t5-small · ORTModel inference latency

Hi, I'm trying to reduce the M2M100 model inference latency, following the guide and the discussion, my code is as following:

from transformers import M2M100Tokenizer, M2M100ForConditionalGeneration
from optimum.onnxruntime import ORTModelForSeq2SeqLM
import time
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = M2M100Tokenizer.from_pretrained("facebook/m2m100_418M")
model = M2M100ForConditionalGeneration.from_pretrained("facebook/m2m100_418M").to(device)
model_ort = ORTModelForSeq2SeqLM.from_pretrained("facebook/m2m100_418M", from_transformers=True).to(device)

tokenizer.src_lang = "en"
inputs = tokenizer("Good Morning!", return_tensors="pt").to(device)

for i in range(10):
    st = time.perf_counter()
    gen_tokens = model.generate(**inputs, forced_bos_token_id=tokenizer.get_lang_id("zh"))
    et = time.perf_counter()

    st2 = time.perf_counter()
    gen_tokens = model_ort.generate(**inputs, forced_bos_token_id=tokenizer.get_lang_id("zh"))
    et2 = time.perf_counter()
    print(f"model latency = {et - st}, ORT latency = {et2 - st2}")
outputs = tokenizer.batch_decode(gen_tokens)

But the result is suprising: ORT Model's average latency is 0.45s, while Origin Model's average latency is 0.1s. I'm confused because ORT model should be faster than the origin model, I would be more than appreciate if someone can help me out!