Works great, much faster inference. Quantization possible?

#1
by jharianto - opened

Thank you for this! As per title, getting good results (same rank output as original model) while running much faster. Memory usage is about the same as original model. Is it possible to quantize these models to try to reduce size & memory footprint & further speed up inference?

Actually just tried it myself, turns out fairly easy to do. It seems to be able to quantize up to optimization level O3 only, not sure if it's possible or needs more tweaking to be able to quantize O4. Not bad, reduced file size further to ~1/4 of original model, lower memory footprint than ONNX-O4, even faster inference and so far output is the same!

Owner

According to the document, O4 and O3 should represent the same level of optimization—the distinction being that O4 is exclusively for GPU usage. I will subsequently examine whether the performance of O3 is indeed superior.

https://huggingface.co/docs/optimum/onnxruntime/usage_guides/optimization

Furthermore, quantification can enhance performance.

Sign up or log in to comment