Is NVIDIA GeForce RTX 4060 Laptop GPU supported as well?

#3
by don412 - opened

GPU test Code:
import onnxruntime as ort

if "CUDAExecutionProvider" in ort.get_available_providers():
print("CUDA is available in standard ONNX Runtime.") # yes, it shows supported
else:
print("CUDA is NOT available in standard ONNX Runtime.")

import onnxruntime_genai as ort_genai

Set the GPU device ID to use (e.g., 0 for the first GPU)

ort_genai.set_current_gpu_device_id(0)

Get and print the current GPU device ID

current_gpu_id = ort_genai.get_current_gpu_device_id()
print("Current GPU Device ID:", current_gpu_id)

OUTPUT:

  1. onnxruntime seems fine
  2. onnxruntime_genai failed ( this line "current_gpu_id = ort_genai.get_current_gpu_device_id()" resulted in an error")

err msg:
in
current_gpu_id = ort_genai.get_current_gpu_device_id()
onnxruntime_genai.onnxruntime_genai.OrtException: D:\a_work\1\s\include\onnxruntime\core/common/logging/logging.h:320 onnxruntime::logging::LoggingManager::DefaultLogger Attempt to use DefaultLogger but none has been registered.

Microsoft org

Can you upgrade to the latest published version (pip uninstall -y onnxruntime-genai onnxruntime-genai-cuda onnxruntime-genai-directml + reinstall the latest version in the README instructions), remove the following lines, and try again?

import onnxruntime as ort

if "CUDAExecutionProvider" in ort.get_available_providers():
    print("CUDA is available in standard ONNX Runtime.") # yes, it shows supported
else:
    print("CUDA is NOT available in standard ONNX Runtime.")

Attempt failed. Please see below. Fyi, I'm able to load the model and then use transformers for text generation (with device="cpu"), however, that's too slow and with some anonying noise, definitely want to leverage the GPU for speed if possible.

pip uninstall -y onnxruntime-genai onnxruntime-genai-cuda onnxruntime-genai-directml
Found existing installation: onnxruntime-genai 0.2.0rc1
Uninstalling onnxruntime-genai-0.2.0rc1:
Successfully uninstalled onnxruntime-genai-0.2.0rc1
Found existing installation: onnxruntime-genai-cuda 0.2.0rc4
Uninstalling onnxruntime-genai-cuda-0.2.0rc4:
Successfully uninstalled onnxruntime-genai-cuda-0.2.0rc4
Found existing installation: onnxruntime-genai-directml 0.2.0rc1
Uninstalling onnxruntime-genai-directml-0.2.0rc1:
Successfully uninstalled onnxruntime-genai-directml-0.2.0rc1

pip install -y onnxruntime-genai onnxruntime-genai-cuda onnxruntime-genai-directml

Usage:
pip install [options] [package-index-options] ...
pip install [options] -r [package-index-options] ...
pip install [options] [-e] ...
pip install [options] [-e] ...
pip install [options] <archive url/path> ...

no such option: -y

pip install onnxruntime-genai onnxruntime-genai-cuda onnxruntime-genai-directml
Collecting onnxruntime-genai
Downloading onnxruntime_genai-0.1.0-cp39-cp39-win_amd64.whl.metadata (152 bytes)
ERROR: Could not find a version that satisfies the requirement onnxruntime-genai-cuda (from versions: none)
ERROR: No matching distribution found for onnxruntime-genai-cuda

pip install --pre onnxruntime-genai-cuda --index-url=https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-genai/pypi/simple/
Looking in indexes: https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-genai/pypi/simple/
Collecting onnxruntime-genai-cuda
Downloading https://aiinfra.pkgs.visualstudio.com/2692857e-05ef-43b4-ba9c-ccf1c22c437c/_packaging/c73ac8fe-be60-4e50-a443-db22e457b281/pypi/download/onnxruntime-genai-cuda/0.2rc4/onnxruntime_genai_cuda-0.2.0rc4-cp39-cp39-win_amd64.whl (138.5 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 138.5/138.5 MB 6.1 MB/s eta 0:00:00
Installing collected packages: onnxruntime-genai-cuda
Successfully installed onnxruntime-genai-cuda-0.2.0rc4

python model-qa.py -m Phi-3-mini-128k-instruct-onnx/directml/directml-int4-awq-block-128 -l 2048
Traceback (most recent call last):
File "c:\Users\myuser\AppData\Local\Programs\Python\Python39\lib\site-packages\onnxruntime_genai_init_.py", line 11, in
from onnxruntime_genai.onnxruntime_genai import *
ImportError: DLL load failed while importing onnxruntime_genai: The specified module could not be found.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "c:\mymodel\model-qa.py", line 37, in
import onnxruntime_genai as og
File "c:\Users\myuser\AppData\Local\Programs\Python\Python39\lib\site-packages\onnxruntime_genai_init_.py", line 14, in
from onnxruntime_genai.onnxruntime_genai import *
ImportError: DLL load failed while importing onnxruntime_genai: The specified module could not be found.

Microsoft org

python model-qa.py -m Phi-3-mini-128k-instruct-onnx/directml/directml-int4-awq-block-128 -l 2048

From the above command, it appears that you want to try the DirectML model. Can you uninstall all three onnxruntime-genai packages, only install the onnxruntime-genai-directml package, and try again?

Only one of the three packages should be installed at any time. Import errors will occur when multiple onnxruntime-genai packages are simultaneously installed. By uninstalling all packages and re-installing only one package, the correct module will be found.

If you want to try models in the cuda folder afterwards, you can uninstall onnxruntime-genai-directml and install onnxruntime-genai-cuda instead. Note that both the onnxruntime-genai-directml and onnxruntime-genai-cuda packages will work for the CPU models if you want to compare performance.

Since the model-qa.py sample code uses "onnxruntime-genai", I've uninstalled the other two, however, same outcome as reported earlier. It does not seem productive to continue to try "onnx".
Instead, I want to improve the working code with "Phi-3-mini-4k-instruct" model. I've include core code, the output and how I'd like it to be improved, I'd appreciate your help on this.

CODE:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
...

model = AutoModelForCausalLM.from_pretrained(
"C:\mymodel\Phi-3-mini-4k-instruct",
device_map="cpu",
torch_dtype="auto",
trust_remote_code=True,
)

...

OUTPUT:
WARNING:tensorflow:From C:\Users\myuser\AppData\Local\Programs\Python\Python39\lib\site-packages\keras\src\losses.py:2976: The name tf.losses.sparse_softmax_cross_entropy is deprecated. Please use tf.compat.v1.losses.sparse_softmax_cross_entropy instead.

flash-attention package not found, consider installing for better performance: No module named 'flash_attn'.
Current flash-attenton does not support window_size. Either upgrade or use attn_implementation='eager'.
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:00<00:00, 4.77it/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

Ask me: why blue berry is the king of fruits?

The model 'Phi3ForCausalLM' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM',
'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'LlamaForCausalLM',
'CodeGenForCausalLM', 'CohereForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'FalconForCausalLM',
'FuyuForCausalLM', 'GemmaForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'LlamaForCausalLM', 'MambaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MistralForCausalLM', 'MixtralForCausalLM', 'MptForCausalLM', 'MusicgenForCausalLM', 'MusicgenMelodyForCausalLM', 'MvpForCausalLM', 'OpenLlamaForCausalLM', 'OpenAIGPTLMHeadModel', 'OPTForCausalLM', 'PegasusForCausalLM', 'PersimmonForCausalLM', 'PhiForCausalLM', 'PLBartForCausalLM', 'ProphetNetForCausalLM', 'QDQBertLMHeadModel', 'Qwen2ForCausalLM', 'ReformerModelWithLMHead', 'RemBertForCausalLM', 'RobertaForCausalLM', 'RobertaPreLayerNormForCausalLM', 'RoCBertForCausalLM', 'RoFormerForCausalLM', 'RwkvForCausalLM', 'Speech2Text2ForCausalLM', 'StableLmForCausalLM', 'Starcoder2ForCausalLM', 'TransfoXLLMHeadModel', 'TrOCRForCausalLM', 'WhisperForCausalLM', 'XGLMForCausalLM', 'XLMWithLMHeadModel', 'XLMProphetNetForCausalLM', 'XLMRobertaForCausalLM', 'XLMRobertaXLForCausalLM', 'XLNetLMHeadModel', 'XmodForCausalLM'].
You are not running the flash-attention implementation, expect numerical differences.

Blueberries are often referred to as the "king of fruits" due to their numerous health benefits, delicious taste, and versatility in culinary applications. Here are some reasons why blueberries are highly regarded:

  1. Nutritional value: Blueberries are rich in essential nutrients, including vitamins C and K, manganese, fiber, and antioxidants. They are low in calories and have a high nutrient density, making them a healthy choice for snacking or incorporating into meals.

  2. Antioxidant content: Blueberries are packed with antioxidants, particularly anthocyanins, which give them their vibrant blue-purple color. These antioxidants help protect cells from damage caused by free radicals, which can contribute to aging and various diseases.
    ...

Additional info:
"flash_attn" seems supports "gpu" / "cuda" only and since my computer fails to support them, hence, unable to install this module.
installation attempt resulted in "... torch.version = 2.2.2+cpu ..."

Question:
given such contraints on what I'm able to do with my current computer and its setting, is it possible to remove the following noise and if yes, how?
"
The model 'Phi3ForCausalLM' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM',
'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'LlamaForCausalLM',
'CodeGenForCausalLM', 'CohereForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'FalconForCausalLM',
'FuyuForCausalLM', 'GemmaForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'LlamaForCausalLM', 'MambaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MistralForCausalLM', 'MixtralForCausalLM', 'MptForCausalLM', 'MusicgenForCausalLM', 'MusicgenMelodyForCausalLM', 'MvpForCausalLM', 'OpenLlamaForCausalLM', 'OpenAIGPTLMHeadModel', 'OPTForCausalLM', 'PegasusForCausalLM', 'PersimmonForCausalLM', 'PhiForCausalLM', 'PLBartForCausalLM', 'ProphetNetForCausalLM', 'QDQBertLMHeadModel', 'Qwen2ForCausalLM', 'ReformerModelWithLMHead', 'RemBertForCausalLM', 'RobertaForCausalLM', 'RobertaPreLayerNormForCausalLM', 'RoCBertForCausalLM', 'RoFormerForCausalLM', 'RwkvForCausalLM', 'Speech2Text2ForCausalLM', 'StableLmForCausalLM', 'Starcoder2ForCausalLM', 'TransfoXLLMHeadModel', 'TrOCRForCausalLM', 'WhisperForCausalLM', 'XGLMForCausalLM', 'XLMWithLMHeadModel', 'XLMProphetNetForCausalLM', 'XLMRobertaForCausalLM', 'XLMRobertaXLForCausalLM', 'XLNetLMHeadModel', 'XmodForCausalLM'].
You are not running the flash-attention implementation, expect numerical differences.
"

Thanks.

Microsoft org
β€’
edited Apr 30

PyTorch

flash-attention package not found, consider installing for better performance: No module named 'flash_attn'.

flash_attn seems supports "gpu" / "cuda" only and since my computer fails to support them, hence, unable to install this module.
installation attempt resulted in ... torch.version = 2.2.2+cpu ...

You are not running the flash-attention implementation, expect numerical differences.

As you mentioned, flash attention has specific requirements and it appears your Windows machine does not satisfy those requirements. Here are the requirements for flash attention. If you have a machine that satisfies the requirements, you can install flash-attn as follows.

# Install flash attention
$ pip install flash-attn --no-build-isolation

If you do not have a machine that satisfies the requirements, you can load the PyTorch model with a different attention implementation by adding attn_implementation="attention_implementation_name" where attention_implementation_name is one of the following options: eager, sdpa, and flash_attention_2.

For example, you can load the PyTorch model with scaled dot-product attention (SDPA) as follows.

model = AutoModelForCausalLM.from_pretrained(
    "C:\mymodel\Phi-3-mini-4k-instruct",
    device_map="cpu",
    torch_dtype="auto",
    trust_remote_code=True,
    attn_implementation="sdpa",
)

The model 'Phi3ForCausalLM' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM',
'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'LlamaForCausalLM',
'CodeGenForCausalLM', 'CohereForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'FalconForCausalLM',
'FuyuForCausalLM', 'GemmaForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'LlamaForCausalLM', 'MambaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MistralForCausalLM', 'MixtralForCausalLM', 'MptForCausalLM', 'MusicgenForCausalLM', 'MusicgenMelodyForCausalLM', 'MvpForCausalLM', 'OpenLlamaForCausalLM', 'OpenAIGPTLMHeadModel', 'OPTForCausalLM', 'PegasusForCausalLM', 'PersimmonForCausalLM', 'PhiForCausalLM', 'PLBartForCausalLM', 'ProphetNetForCausalLM', 'QDQBertLMHeadModel', 'Qwen2ForCausalLM', 'ReformerModelWithLMHead', 'RemBertForCausalLM', 'RobertaForCausalLM', 'RobertaPreLayerNormForCausalLM', 'RoCBertForCausalLM', 'RoFormerForCausalLM', 'RwkvForCausalLM', 'Speech2Text2ForCausalLM', 'StableLmForCausalLM', 'Starcoder2ForCausalLM', 'TransfoXLLMHeadModel', 'TrOCRForCausalLM', 'WhisperForCausalLM', 'XGLMForCausalLM', 'XLMWithLMHeadModel', 'XLMProphetNetForCausalLM', 'XLMRobertaForCausalLM', 'XLMRobertaXLForCausalLM', 'XLNetLMHeadModel', 'XmodForCausalLM'].

You can instantiate a pipeline by creating the config, tokenizer, and model objects and using them to create the pipeline object. Here is an example.

from transformers import AutoConfig, AutoTokenizer, AutoModelForCausalLM, pipeline

config = AutoConfig.from_pretrained("microsoft/Phi-3-mini-4k-instruct", cache_dir="/path/to/your/cache/dir/", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct", cache_dir="/path/to/your/cache/dir/", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-3-mini-4k-instruct", cache_dir="/path/to/your/cache/dir/", trust_remote_code=True)

pipe = pipeline("text-generation", model=model, config=config, tokenizer=tokenizer)
result = pipe(["<|user|>What is the tallest building in the world?<|end|><|assistant|>"])
print(result)

ONNX

Since the model-qa.py sample code uses "onnxruntime-genai", I've uninstalled the other two, however, same outcome as reported earlier.

All three packages (onnxruntime-genai, onnxruntime-genai-cuda, and onnxruntime-genai-directml) are mapped to the same module name, onnxruntime-genai. Thus, the import statement (import onnxruntime_genai as og) and sample code (model-qa.py) is the same for all three packages, and it is why only one of the packages should be installed at a time.

Here are the steps you can follow per package.

  1. First, follow the initial instructions to get the repo and sample script. These instructions are shared for CPU, CUDA, and DirectML.

Initial Instructions

# Clone the ONNX repo
$ git clone https://huggingface.co/microsoft/Phi-3-mini-128k-instruct-onnx

# Get sample script
$ curl https://raw.githubusercontent.com/microsoft/onnxruntime-genai/main/examples/python/model-qa.py -o model-qa.py
  1. Then, follow the instructions for the specific execution provider you want to run with.

CPU-specific instructions

# Verify that all packages are uninstalled
$ pip uninstall -y onnxruntime-genai onnxruntime-genai-cuda onnxruntime-genai-directml

# Only install the CPU package
$ pip install --pre onnxruntime-genai

# Run script with sample CPU model
$ python3 model-qa.py -m ./Phi-3-mini-128k-instruct-onnx/cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4 -l 2048

CUDA-specific instructions

# Verify that all packages are uninstalled
$ pip uninstall -y onnxruntime-genai onnxruntime-genai-cuda onnxruntime-genai-directml

# Only install the CUDA package
$ pip install --pre onnxruntime-genai-cuda --index-url=https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-genai/pypi/simple/

# Run script with sample CUDA model
$ python3 model-qa.py -m ./Phi-3-mini-128k-instruct-onnx/cuda/cuda-int4-rtn-block-32 -l 2048

DirectML-specific instructions

# Verify that all packages are uninstalled
$ pip uninstall -y onnxruntime-genai onnxruntime-genai-cuda onnxruntime-genai-directml

# Only install the DirectML package
$ pip install --pre onnxruntime-genai-directml

# Run script with sample DirectML model
$ python3 model-qa.py -m ./Phi-3-mini-128k-instruct-onnx/directml/directml-int4-awq-block-128 -l 2048
kvaishnavi changed discussion status to closed

Sign up or log in to comment