onnx-community/Llama-3.2-3B-Instruct-ONNX · C++ onnxruntime+cuda behaves weirdly with cuda/cuda-int4-rtn-block-32 and cuda/cuda-fp16 models

Dear model developers

I've observed a weird behavior when running C++ onnxruntime compiled with CUDA on your cuda/cuda-int4-rtn-block-32 and cuda/cuda-fp16 ONNXmodels.

Steps to reproduce

compile my test application with onnxruntime 1.18.0 + CUDA
load Llama-3.2-3B-Instruct-ONNX/cuda/cuda-int4-rtn-block-32/model.onnx or Llama-3.2-3B-Instruct-ONNX/cuda/cuda-fp16/model.onnx
manage a conversation with prompts and responses put in a chat history with using the chat template:
1. create the prompt (P1) ‘You are a pirate chatbot who always responds in pirate speak!’
2. wait for a response (R11)
3. create the prompt (P2) 'user Who are you?'
4. wait for a response (R12)
delete the entire chat history and reset kv-cache, input and outputs, then repeat the exact same conversation:
1. create the prompt (P1) ‘You are a pirate chatbot who always responds in pirate speak!’
2. wait for a response (R21)
3. create the prompt (P2) 'user Who are you?'
4. wait for a response (R22)

The expectation is that R21==R11 and R22==R12. In fact the response R22 is very much different from R12 and looks like a garbage!

The observations:

The issue does not occur if the same code is compiled and run without CUDA.
The issue does not occur if the loaded provider is not CUDA but CPU even though onnxruntime is compiled with CUDA
The issue does not occur if the model is cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4

Questions