C++ onnxruntime+cuda behaves weirdly with cuda/cuda-int4-rtn-block-32 and cuda/cuda-fp16 models
#3
by
idruker
- opened
Dear model developers
I've observed a weird behavior when running C++ onnxruntime compiled with CUDA on your cuda/cuda-int4-rtn-block-32 and cuda/cuda-fp16 ONNXmodels.
Steps to reproduce
- compile my test application with onnxruntime 1.18.0 + CUDA
- load Llama-3.2-3B-Instruct-ONNX/cuda/cuda-int4-rtn-block-32/model.onnx or Llama-3.2-3B-Instruct-ONNX/cuda/cuda-fp16/model.onnx
- manage a conversation with prompts and responses put in a chat history with using the chat template:
- create the prompt (P1) ‘You are a pirate chatbot who always responds in pirate speak!’
- wait for a response (R11)
- create the prompt (P2) 'user Who are you?'
- wait for a response (R12)
- delete the entire chat history and reset kv-cache, input and outputs, then repeat the exact same conversation:
- create the prompt (P1) ‘You are a pirate chatbot who always responds in pirate speak!’
- wait for a response (R21)
- create the prompt (P2) 'user Who are you?'
- wait for a response (R22)
The expectation is that R21==R11 and R22==R12. In fact the response R22 is very much different from R12 and looks like a garbage!
The observations:
- The issue does not occur if the same code is compiled and run without CUDA.
- The issue does not occur if the loaded provider is not CUDA but CPU even though onnxruntime is compiled with CUDA
- The issue does not occur if the model is cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4
Questions
- is there a bug in the way CUDA flavors of LLM models are prepared?
- is there a bug in onnxruntime+CUDA?