Request for a complete MWE to run ONNX models with the current `optimum` package in Python

#4
by CLRafaelR - opened

Several users have reported that they encountered the difficulties to load microsoft/Phi-3-mini-<4k/128k>-instruct-onnx when attempting to run it with the Optimum package. I've created a script that can load the model using the Optimum package, but I'm still struggling to perform inference with the loaded model.

Therefore, I'm seeking a comprehensive Minimal Working Example (MWE) in Python for loading and running the microsoft/Phi-3-mini-128k-instruct-onnx model with the current version of Hugging Face's optimum package (=< 1.19.1) before Hugging Face releases the new version of optimum, which officially supports Phi-3's ONNX models.

As mentioned in this post, I successfully loaded and performed inference on this ONNX model using microsoft/onnxruntime-genai instead of optimum. However, when generating text in languages that use non-ASCII characters (e.g., Japanese), the tokenizer in onnxruntime-genai introduces numerous encoding issues (i.e. mojibake), rendering the text unreadable. Therefore, I prefer to use optimum over microsoft/onnxruntime-genai this time.

A post by a Microsoft member here suggests modifying parts of optimum/optimum/utils/normalized_config.py and optimum/optimum/modeling_base.py as a solution. However, this approach compromises reproducibility across different developers and environments. Furthermore, reinstalling the package would necessitate these modifications again. Thus, I'm looking for a solution that doesn't involve rewriting any existing Python packages.

I have written the following mwe.py, which successfully loads the model. However, I'm still suffering the error Error in execution: Non-zero status code returned while running GroupQueryAttention node. Name:'/model/layers.0/attn/GroupQueryAttention' Status Message: cos_cache dimension 0 must be of present_sequence_length. (The full error message is provided below). How can this error message be resolved?

mwe.py

Thanks to this post, I was able to make the TasksManager and NormalizedConfigManager compatible with Phi-3.

Additionally, following this post, I installed a pre-release version of ORT that matches my environment's CUDA version.

import torch
from huggingface_hub import (
    snapshot_download,
)
from transformers import (
    AutoTokenizer,
    AutoConfig,
)
import json
import time
import gc
from optimum.pipelines import pipeline
import onnxruntime
from optimum.onnxruntime import (
    ORTModelForQuestionAnswering,
    ORTModelForCausalLM,
)
from optimum.exporters import TasksManager
from optimum.utils import NormalizedConfigManager
import os

torch.random.manual_seed(0)

model_name = "microsoft/Phi-3-mini-128k-instruct-onnx"
provider_name = model_name.split("/", 1)[0]
model_name_short = model_name.split("/", 1)[1]

file_name = "cuda/cuda-fp16/phi3-mini-128k-instruct-cuda-fp16.onnx"

"""
# Run this line if you want to download the whole repository of the model
# to the default local cache for Hugging Face package
snapshot_download(
    repo_id=model_name,
)
"""

session_options = onnxruntime.SessionOptions()
# session_options.log_severity_level = 0 # The model won't be loaded if this line is executed
session_options.graph_optimization_level = (
    onnxruntime.GraphOptimizationLevel.ORT_ENABLE_BASIC
)

config = AutoConfig.from_pretrained(
    f"{os.path.expanduser('~')}/.cache/huggingface/hub/models--{provider_name}--{model_name_short}/snapshots/e85d1b352b6f6f2a30d188f35c478af323af2449/cuda/cuda-fp16",
    force_download=False,
    trust_remote_code=True,
)

# The two lines below must be executed to load the model successfully!
# Copy the settings for `phi` and add them as the settings for `phi3`
# https://github.com/huggingface/optimum/issues/1826#issuecomment-2075070853
TasksManager._SUPPORTED_MODEL_TYPE["phi3"] = TasksManager._SUPPORTED_MODEL_TYPE["phi"]
NormalizedConfigManager._conf["phi3"] = NormalizedConfigManager._conf["phi"]


model = ORTModelForCausalLM.from_pretrained(
    f"{os.path.expanduser('~')}/.cache/huggingface/hub/models--{provider_name}--{model_name_short}/snapshots/e85d1b352b6f6f2a30d188f35c478af323af2449/cuda/cuda-fp16",
    # provider="CPUExecutionProvider",
    provider="CUDAExecutionProvider",
    trust_remote_code=True,
    local_files_only=True,
    config=config,
    force_download=False,
    session_options=session_options,
    use_io_binding=False,
)

tokenizer = AutoTokenizer.from_pretrained(
    f"{os.path.expanduser('~')}/.cache/huggingface/hub/models--{provider_name}--{model_name_short}/snapshots/e85d1b352b6f6f2a30d188f35c478af323af2449/cuda/cuda-fp16"
)

user_prompt = "Hi!"

pipe = pipeline(
    task="text-generation",
    model=model,
    tokenizer=tokenizer,
    # accelerator="ort",
    device="cuda:0",
    torch_dtype=torch.float16,
)

generation_args = {
    "max_new_tokens": 512,
    "return_full_text": False,
    "temperature": 0.01,
    "do_sample": True,
}

start_time = time.time()

output = pipe(
    user_prompt,
    **generation_args,
)

gc.collect()
torch.cuda.empty_cache()
torch.cuda.reset_peak_memory_stats()

end_time = time.time()

exec_time = end_time - start_time

print(
    output[0]["generated_text"],
    f"###\n\n{exec_time} sec elapsed.\n\n###",
    sep="\n\n",
)

Full Error Message

The argument `trust_remote_code` is to be used along with export=True. It will be ignored.
The ONNX file phi3-mini-128k-instruct-cuda-fp16.onnx is not a regular name used in optimum.onnxruntime that are ['model.onnx', 'model_quantized.onnx', 'model_optimized.onnx', 'decoder_with_past_model.onnx', 'decoder_with_past_model_quantized.onnx', 'decoder_with_past_model_optimized.onnx'], the ORTModelForCausalLM might not behave as expected.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
2024-05-05 15:41:12.882026696 [W:onnxruntime:, session_state.cc:1166 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2024-05-05 15:41:12.882065769 [W:onnxruntime:, session_state.cc:1168 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
use_io_binding was set to False, setting it to True because it can provide a huge speedup on GPUs. It is possible to disable this feature manually by setting the use_io_binding attribute back to False.
2024-05-05 15:41:21.304207958 [W:onnxruntime:, session_state.cc:1166 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2024-05-05 15:41:21.304248608 [W:onnxruntime:, session_state.cc:1168 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
2024-05-05 15:41:26.906309614 [E:onnxruntime:, sequential_executor.cc:514 ExecuteKernel] Non-zero status code returned while running GroupQueryAttention node. Name:'/model/layers.0/attn/GroupQueryAttention' Status Message: cos_cache dimension 0 must be of present_sequence_length.

{
    "name": "RuntimeError",
    "message": "Error in execution: Non-zero status code returned while running GroupQueryAttention node. Name:'/model/layers.0/attn/GroupQueryAttention' Status Message: cos_cache dimension 0 must be of present_sequence_length.",
    "stack": "---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
/home/MY_DIRECTORY/MY_PROJECT/test/phi-3/analyze_data_onnx.py in line 89
     <a href='file:///home/MY_DIRECTORY/MY_PROJECT/test/phi-3/analyze_data_onnx.py?line=80'>81</a> generation_args = {
     <a href='file:///home/MY_DIRECTORY/MY_PROJECT/test/phi-3/analyze_data_onnx.py?line=81'>82</a>     \"max_new_tokens\": 512,
     <a href='file:///home/MY_DIRECTORY/MY_PROJECT/test/phi-3/analyze_data_onnx.py?line=82'>83</a>     \"return_full_text\": False,
     <a href='file:///home/MY_DIRECTORY/MY_PROJECT/test/phi-3/analyze_data_onnx.py?line=83'>84</a>     \"temperature\": 0.01,
     <a href='file:///home/MY_DIRECTORY/MY_PROJECT/test/phi-3/analyze_data_onnx.py?line=84'>85</a>     \"do_sample\": True,
     <a href='file:///home/MY_DIRECTORY/MY_PROJECT/test/phi-3/analyze_data_onnx.py?line=85'>86</a> }
     <a href='file:///home/MY_DIRECTORY/MY_PROJECT/test/phi-3/analyze_data_onnx.py?line=87'>88</a> start_time = time.time()
---> <a href='file:///home/MY_DIRECTORY/MY_PROJECT/test/phi-3/analyze_data_onnx.py?line=88'>89</a> output = pipe(
     <a href='file:///home/MY_DIRECTORY/MY_PROJECT/test/phi-3/analyze_data_onnx.py?line=89'>90</a>     user_prompt,
     <a href='file:///home/MY_DIRECTORY/MY_PROJECT/test/phi-3/analyze_data_onnx.py?line=90'>91</a>     **generation_args,
     <a href='file:///home/MY_DIRECTORY/MY_PROJECT/test/phi-3/analyze_data_onnx.py?line=91'>92</a> )
     <a href='file:///home/MY_DIRECTORY/MY_PROJECT/test/phi-3/analyze_data_onnx.py?line=93'>94</a> gc.collect()
     <a href='file:///home/MY_DIRECTORY/MY_PROJECT/test/phi-3/analyze_data_onnx.py?line=94'>95</a> torch.cuda.empty_cache()

File ~/MY_PROJECT/.venv/lib/python3.11/site-packages/transformers/pipelines/text_generation.py:240, in TextGenerationPipeline.__call__(self, text_inputs, **kwargs)
    <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/transformers/pipelines/text_generation.py?line=237'>238</a>         return super().__call__(chats, **kwargs)
    <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/transformers/pipelines/text_generation.py?line=238'>239</a> else:
--> <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/transformers/pipelines/text_generation.py?line=239'>240</a>     return super().__call__(text_inputs, **kwargs)

File ~/MY_PROJECT/.venv/lib/python3.11/site-packages/transformers/pipelines/base.py:1242, in Pipeline.__call__(self, inputs, num_workers, batch_size, *args, **kwargs)
   <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/transformers/pipelines/base.py?line=1233'>1234</a>     return next(
   <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/transformers/pipelines/base.py?line=1234'>1235</a>         iter(
   <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/transformers/pipelines/base.py?line=1235'>1236</a>             self.get_iterator(
   (...)
   <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/transformers/pipelines/base.py?line=1238'>1239</a>         )
   <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/transformers/pipelines/base.py?line=1239'>1240</a>     )
   <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/transformers/pipelines/base.py?line=1240'>1241</a> else:
-> <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/transformers/pipelines/base.py?line=1241'>1242</a>     return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)

File ~/MY_PROJECT/.venv/lib/python3.11/site-packages/transformers/pipelines/base.py:1249, in Pipeline.run_single(self, inputs, preprocess_params, forward_params, postprocess_params)
   <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/transformers/pipelines/base.py?line=1246'>1247</a> def run_single(self, inputs, preprocess_params, forward_params, postprocess_params):
   <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/transformers/pipelines/base.py?line=1247'>1248</a>     model_inputs = self.preprocess(inputs, **preprocess_params)
-> <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/transformers/pipelines/base.py?line=1248'>1249</a>     model_outputs = self.forward(model_inputs, **forward_params)
   <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/transformers/pipelines/base.py?line=1249'>1250</a>     outputs = self.postprocess(model_outputs, **postprocess_params)
   <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/transformers/pipelines/base.py?line=1250'>1251</a>     return outputs

File ~/MY_PROJECT/.venv/lib/python3.11/site-packages/transformers/pipelines/base.py:1149, in Pipeline.forward(self, model_inputs, **forward_params)
   <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/transformers/pipelines/base.py?line=1146'>1147</a>     with inference_context():
   <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/transformers/pipelines/base.py?line=1147'>1148</a>         model_inputs = self._ensure_tensor_on_device(model_inputs, device=self.device)
-> <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/transformers/pipelines/base.py?line=1148'>1149</a>         model_outputs = self._forward(model_inputs, **forward_params)
   <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/transformers/pipelines/base.py?line=1149'>1150</a>         model_outputs = self._ensure_tensor_on_device(model_outputs, device=torch.device(\"cpu\"))
   <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/transformers/pipelines/base.py?line=1150'>1151</a> else:

File ~/MY_PROJECT/.venv/lib/python3.11/site-packages/transformers/pipelines/text_generation.py:327, in TextGenerationPipeline._forward(self, model_inputs, **generate_kwargs)
    <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/transformers/pipelines/text_generation.py?line=323'>324</a>         generate_kwargs[\"min_length\"] += prefix_length
    <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/transformers/pipelines/text_generation.py?line=325'>326</a> # BS x SL
--> <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/transformers/pipelines/text_generation.py?line=326'>327</a> generated_sequence = self.model.generate(input_ids=input_ids, attention_mask=attention_mask, **generate_kwargs)
    <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/transformers/pipelines/text_generation.py?line=327'>328</a> out_b = generated_sequence.shape[0]
    <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/transformers/pipelines/text_generation.py?line=328'>329</a> if self.framework == \"pt\":

File ~/MY_PROJECT/.venv/lib/python3.11/site-packages/torch/utils/_contextlib.py:115, in context_decorator.<locals>.decorate_context(*args, **kwargs)
    <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/torch/utils/_contextlib.py?line=111'>112</a> @functools.wraps(func)
    <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/torch/utils/_contextlib.py?line=112'>113</a> def decorate_context(*args, **kwargs):
    <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/torch/utils/_contextlib.py?line=113'>114</a>     with ctx_factory():
--> <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/torch/utils/_contextlib.py?line=114'>115</a>         return func(*args, **kwargs)

File ~/MY_PROJECT/.venv/lib/python3.11/site-packages/transformers/generation/utils.py:1622, in GenerationMixin.generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, assistant_model, streamer, negative_prompt_ids, negative_prompt_attention_mask, **kwargs)
   <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/transformers/generation/utils.py?line=1613'>1614</a>     input_ids, model_kwargs = self._expand_inputs_for_generation(
   <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/transformers/generation/utils.py?line=1614'>1615</a>         input_ids=input_ids,
   <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/transformers/generation/utils.py?line=1615'>1616</a>         expand_size=generation_config.num_return_sequences,
   <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/transformers/generation/utils.py?line=1616'>1617</a>         is_encoder_decoder=self.config.is_encoder_decoder,
   <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/transformers/generation/utils.py?line=1617'>1618</a>         **model_kwargs,
   <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/transformers/generation/utils.py?line=1618'>1619</a>     )
   <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/transformers/generation/utils.py?line=1620'>1621</a>     # 13. run sample
-> <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/transformers/generation/utils.py?line=1621'>1622</a>     result = self._sample(
   <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/transformers/generation/utils.py?line=1622'>1623</a>         input_ids,
   <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/transformers/generation/utils.py?line=1623'>1624</a>         logits_processor=prepared_logits_processor,
   <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/transformers/generation/utils.py?line=1624'>1625</a>         logits_warper=logits_warper,
   <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/transformers/generation/utils.py?line=1625'>1626</a>         stopping_criteria=prepared_stopping_criteria,
   <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/transformers/generation/utils.py?line=1626'>1627</a>         pad_token_id=generation_config.pad_token_id,
   <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/transformers/generation/utils.py?line=1627'>1628</a>         output_scores=generation_config.output_scores,
   <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/transformers/generation/utils.py?line=1628'>1629</a>         output_logits=generation_config.output_logits,
   <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/transformers/generation/utils.py?line=1629'>1630</a>         return_dict_in_generate=generation_config.return_dict_in_generate,
   <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/transformers/generation/utils.py?line=1630'>1631</a>         synced_gpus=synced_gpus,
   <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/transformers/generation/utils.py?line=1631'>1632</a>         streamer=streamer,
   <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/transformers/generation/utils.py?line=1632'>1633</a>         **model_kwargs,
   <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/transformers/generation/utils.py?line=1633'>1634</a>     )
   <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/transformers/generation/utils.py?line=1635'>1636</a> elif generation_mode == GenerationMode.BEAM_SEARCH:
   <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/transformers/generation/utils.py?line=1636'>1637</a>     # 11. prepare beam search scorer
   <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/transformers/generation/utils.py?line=1637'>1638</a>     beam_scorer = BeamSearchScorer(
   <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/transformers/generation/utils.py?line=1638'>1639</a>         batch_size=batch_size,
   <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/transformers/generation/utils.py?line=1639'>1640</a>         num_beams=generation_config.num_beams,
   (...)
   <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/transformers/generation/utils.py?line=1644'>1645</a>         max_length=generation_config.max_length,
   <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/transformers/generation/utils.py?line=1645'>1646</a>     )

File ~/MY_PROJECT/.venv/lib/python3.11/site-packages/transformers/generation/utils.py:2791, in GenerationMixin._sample(self, input_ids, logits_processor, stopping_criteria, logits_warper, max_length, pad_token_id, eos_token_id, output_attentions, output_hidden_states, output_scores, output_logits, return_dict_in_generate, synced_gpus, streamer, **model_kwargs)
   <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/transformers/generation/utils.py?line=2787'>2788</a> model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
   <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/transformers/generation/utils.py?line=2789'>2790</a> # forward pass to get next token
-> <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/transformers/generation/utils.py?line=2790'>2791</a> outputs = self(
   <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/transformers/generation/utils.py?line=2791'>2792</a>     **model_inputs,
   <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/transformers/generation/utils.py?line=2792'>2793</a>     return_dict=True,
   <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/transformers/generation/utils.py?line=2793'>2794</a>     output_attentions=output_attentions,
   <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/transformers/generation/utils.py?line=2794'>2795</a>     output_hidden_states=output_hidden_states,
   <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/transformers/generation/utils.py?line=2795'>2796</a> )
   <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/transformers/generation/utils.py?line=2797'>2798</a> if synced_gpus and this_peer_finished:
   <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/transformers/generation/utils.py?line=2798'>2799</a>     continue  # don't waste resources running the code we don't need

File ~/MY_PROJECT/.venv/lib/python3.11/site-packages/optimum/modeling_base.py:92, in OptimizedModel.__call__(self, *args, **kwargs)
     <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/optimum/modeling_base.py?line=90'>91</a> def __call__(self, *args, **kwargs):
---> <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/optimum/modeling_base.py?line=91'>92</a>     return self.forward(*args, **kwargs)

File ~/MY_PROJECT/.venv/lib/python3.11/site-packages/optimum/onnxruntime/modeling_decoder.py:258, in ORTModelForCausalLM.forward(self, input_ids, attention_mask, position_ids, past_key_values, labels, use_cache_branch, **kwargs)
    <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/optimum/onnxruntime/modeling_decoder.py?line=255'>256</a> else:
    <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/optimum/onnxruntime/modeling_decoder.py?line=256'>257</a>     io_binding.synchronize_inputs()
--> <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/optimum/onnxruntime/modeling_decoder.py?line=257'>258</a>     self.model.run_with_iobinding(io_binding)
    <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/optimum/onnxruntime/modeling_decoder.py?line=258'>259</a>     io_binding.synchronize_outputs()
    <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/optimum/onnxruntime/modeling_decoder.py?line=260'>261</a> if self.use_cache:
    <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/optimum/onnxruntime/modeling_decoder.py?line=261'>262</a>     # Tuple of length equal to : number of layer * number of past_key_value per decoder layer(2)

File ~/MY_PROJECT/.venv/lib/python3.11/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py:331, in Session.run_with_iobinding(self, iobinding, run_options)
    <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py?line=323'>324</a> def run_with_iobinding(self, iobinding, run_options=None):
    <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py?line=324'>325</a>     \"\"\"
    <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py?line=325'>326</a>     Compute the predictions.
    <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py?line=326'>327</a> 
    <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py?line=327'>328</a>     :param iobinding: the iobinding object that has graph inputs/outputs bind.
    <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py?line=328'>329</a>     :param run_options: See :class:`onnxruntime.RunOptions`.
    <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py?line=329'>330</a>     \"\"\"
--> <a href='file:///home/MY_DIRECTORY/MY_PROJECT/.venv/lib/python3.11/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py?line=330'>331</a>     self._sess.run_with_iobinding(iobinding._iobinding, run_options)

RuntimeError: Error in execution: Non-zero status code returned while running GroupQueryAttention node. Name:'/model/layers.0/attn/GroupQueryAttention' Status Message: cos_cache dimension 0 must be of present_sequence_length."
}
This comment has been hidden

My Environment Info

I ran my MWE on Windows Subsystem for Linux 2 (WSL2; Host machine is Windows 10) with Python 3.11.8 and CUDA 12.4 installed. My GPU is an NVIDIA GeForce RTX 3060.

pyproject.toml

I'm managing packages with Poetry, and the requirements.txt is furnished upon request (but it is quite lengthy because it was exported from Poetry).

[tool.poetry]
name = "MY_PROJECT"
version = "0.1.0"
description = ""
authors = ["Masataka Ogawa"]
license = "Apache-2.0"
readme = "README.md"

[tool.poetry.dependencies]
python = ">=3.11.7"
torch = {version = "^2.3.0+cu121", source = "torch_cu121"}
torchvision = {version = "^0.18.0+cu121", source = "torch_cu121"}
torchaudio = {version = "^2.3.0+cu121", source = "torch_cu121"}
accelerate = "^0.29.3"
numpy = "^1.26.4"
onnx = "^1.16.0"
ort-nightly-gpu = {version = "^1.17.0.dev20240118002", source = "ort_nightly_gpu"}
joblib = "^1.4.0"
bitsandbytes = "^0.43.1"
llama-cpp-python = {version = "^0.2.68", source = "llama_cpp_python_cu124"}
packaging = "^24.0"
ninja = "^1.11.1.1"
wheel = "^0.43.0"
setuptools = "^69.5.1"
onnxruntime-genai-cuda = {version = "^0.2.0rc4", allow-prereleases = true, source = "onnxruntime_genai_cuda"}
ort-nightly = {version = "^1.19.0.dev20240502004", source = "ort_nightly"}
onnxruntime-gpu = {version = "^1.17.1", allow-prereleases = true}
optimum = {extras = ["onnxruntime-gpu"], version = "^1.19.1", allow-prereleases = true}

[tool.poetry.group.dev.dependencies]
black = "^24.4.2"
ipykernel = "^6.29.4"


[[tool.poetry.source]]
name = "torch_cu121"
url = "https://download.pytorch.org/whl/cu121"
priority = "explicit"


[[tool.poetry.source]]
name = "onnxruntime_genai_cuda"
url = "https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-genai/pypi/simple/"
priority = "explicit"


[[tool.poetry.source]]
name = "ort_nightly_gpu"
url = "https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ort-cuda-12-nightly/pypi/simple/"
priority = "explicit"


[[tool.poetry.source]]
name = "llama_cpp_python_cu124"
url = "https://abetlen.github.io/llama-cpp-python/whl/cu124"
priority = "primary"


[[tool.poetry.source]]
name = "ort_nightly"
url = "https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple/"
priority = "explicit"

[build-system]
requires = ["poetry-core"]
build-backend = "poetry.core.masonry.api"
Microsoft org

Hi again @CLRafaelR ,

As mentioned in the GitHub issue, you can swap tokenizers in and out with the Python API of onnxruntime-genai.

And also, a caution that Phi-3 will perform best in English:

Phi-3-mini was predominantly trained and optimized for English. Its capabilities in other
languages are limited, meaning it could understand but will not be as fluent as English.
Customers are encouraged to use Microsoft Translator service in tandem to translate prompt
and responses for best results.

Microsoft org

The Phi-3 mini changes recommended here have now been merged into Optimum. The changes are also included in Optimum's v1.19.2 patch release. You can now install Optimum with pip install optimum --upgrade.

Error in execution: Non-zero status code returned while running GroupQueryAttention node. Name:'/model/layers.0/attn/GroupQueryAttention' Status Message: cos_cache dimension 0 must be of present_sequence_length.

This error is happening because you have multiple ONNX Runtime and ONNX Runtime GenAI versions installed simultaneously. From the above information, currently you have the following installed.

ort-nightly-gpu = {version = "^1.17.0.dev20240118002", source = "ort_nightly_gpu"}
onnxruntime-genai-cuda = {version = "^0.2.0rc4", allow-prereleases = true, source = "onnxruntime_genai_cuda"}
ort-nightly = {version = "^1.19.0.dev20240502004", source = "ort_nightly"}
onnxruntime-gpu = {version = "^1.17.1", allow-prereleases = true}

To install the right package to run Phi-3 mini with Optimum, you can follow the instructions here. If you are using Optimum and not ONNX Runtime GenAI, then you should also add onnxruntime-genai-cuda to the list of packages to uninstall in the linked instructions.

Setup Guide for Phi-3-mini-128k 4int with CUDA

This guide will show you how to set up the Phi-3-mini-128k 4int model from Microsoft with CUDA support in a Conda environment.

Prerequisites

  • Windows 10/11
  • Miniconda or Anaconda installed
  • Access to Visual Studio's Package Index and NVIDIA's PyPi Repository
  • CUDA installed on your system

Installation Steps

  1. Create and activate a Conda environment

    Open your Command Prompt and execute the following commands:

    conda create -n onnx python=3.11
    conda activate onnx
    
  2. Create a directory for the project

    Within the active Conda environment:

    mkdir onnx-new
    cd onnx-new
    
  3. Install dependencies

    Install numpy and onnxruntime-genai-cuda:

    pip install numpy
    pip install onnxruntime-genai-cuda --pre --index-url=https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-genai/pypi/simple/
    
  4. Clone the Phi-3-mini-128k-instruct-onnx model

    Use git to clone the required model:

    git clone https://huggingface.co/microsoft/Phi-3-mini-128k-instruct-onnx
    
  5. Download the example script

    Download the model interaction example script:

    curl https://raw.githubusercontent.com/microsoft/onnxruntime-genai/main/examples/python/model-qa.py -o model-qa.py
    
  6. Run the model

    Execute the script with the specified model:

    python model-qa.py -m ./Phi-3-mini-128k-instruct-onnx/cuda/cuda-int4-rtn-block-32 -l 2048
    

    If an error occurs that points to a missing Python installation, make sure Python is correctly installed and available in the system path.

Installed packages to make it run

(onnx) C:\Users\Username\onnx-new>pip list
Package                Version
---------------------- --------
numpy                  1.26.4
onnxruntime-genai-cuda 0.2.0rc6
pip                    24.0
setuptools             69.5.1
wheel                  0.43.0
kvaishnavi changed discussion status to closed

Sign up or log in to comment