ORTModel inference latency

#1
by Lvxue - opened

Hi, I'm trying to reduce the M2M100 model inference latency, following the guide and the discussion, my code is as following:

from transformers import M2M100Tokenizer, M2M100ForConditionalGeneration
from optimum.onnxruntime import ORTModelForSeq2SeqLM
import time
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = M2M100Tokenizer.from_pretrained("facebook/m2m100_418M")
model = M2M100ForConditionalGeneration.from_pretrained("facebook/m2m100_418M").to(device)
model_ort = ORTModelForSeq2SeqLM.from_pretrained("facebook/m2m100_418M", from_transformers=True).to(device)

tokenizer.src_lang = "en"
inputs = tokenizer("Good Morning!", return_tensors="pt").to(device)

for i in range(10):
    st = time.perf_counter()
    gen_tokens = model.generate(**inputs, forced_bos_token_id=tokenizer.get_lang_id("zh"))
    et = time.perf_counter()

    st2 = time.perf_counter()
    gen_tokens = model_ort.generate(**inputs, forced_bos_token_id=tokenizer.get_lang_id("zh"))
    et2 = time.perf_counter()
    print(f"model latency = {et - st}, ORT latency = {et2 - st2}")
outputs = tokenizer.batch_decode(gen_tokens)

But the result is suprising: ORT Model's average latency is 0.45s, while Origin Model's average latency is 0.1s. I'm confused because ORT model should be faster than the origin model, I would be more than appreciate if someone can help me out!

Hugging Face Optimum org

Hey @Lvxue , are you using a GPU? We've indeed had reports of slowdowns with onnxruntime and GPU. The issue was that there was some communication overhead between the CPU and GPU.

This PR was merged that fixes the issue, and the next Optimum release will include the fix!

In the meanwhile, if you want to try it, you can use the dev branch with pip uninstall optimum && pip install git+https://github.com/huggingface/optimum.git#egg=optimum[onnxruntime-gpu].

There has been reports of great speedups (for example here and here) and we'll soon put up more documentation as to what speedups are expected.

Sign up or log in to comment