How to convert to ONNX?

#2
by HoangHa - opened

I want to convert the variant of the Llama-3 model to ONNX. I tried this example but got no luck. How did you convert llama-3 successfully?

https://github.com/microsoft/onnxruntime-inference-examples/blob/8fcc97e1e035d57ffdfd19b76732e3fc79d8c2a6/python/models/llama/LLaMA-2%20E2E%20Notebook.ipynb

At Aladeen University we thrive for excellence.
Can you tell me the hardware you’re using?

I'm using a machine with around 128GB RAM without a GPU. Is it GPU needed? Because I only have a small GPU.

What is the export format that you’re trying to export. Like quantization and data format

My end goal is to export as AWQ at the end but for now I'm trying to do float16 first to understand the process.

There are some more instructions you can try here.
https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/python/tools/transformers/models/llama

Do tell me what is the exact issue you are facing, then our super experts can look into it.

I am not quite sure about the specific quantization methos like AWQ or GPTQ are available on ORT, i might be wrong as well.
do look the docs here. https://onnxruntime.ai/docs/performance/model-optimizations/quantization.html

I tried with the onnxruntime example to convert llama3 to fp16 using CPU only but I got this error. Maybe for CPU it doesn't support GQA?

onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Node (/model/layers.0/self_attn/o_proj/MatMul) Op (MatMul) [ShapeInferenceError] Incompatible dimensions for matrix multiplication

There might be some bugs and issues , as LLAMA3 is a newer model use torch and onnxruntime either nightly or latest build.
This is a exporter bug I guess

Group Query Attention is GPU specific as stated in here .

https://github.com/microsoft/Olive/blob/main/examples/llama2/README.md

Group Query attention might not have extreme performance benefits in CPU specific workloads .

I see. Thanks for your help

HoangHa changed discussion status to closed

Sign up or log in to comment