quants
Hi, thanks for the onnx version!
I am a newbie to onnx (never used before), my question is: this model can be quantized?
I readed something about onnx and quantization in the runtime docs but the truth is I couldn't understand what method or tool I should use to quantize this model for cpu and avx512 use. Could you enlighten me on how to achieve quantization, I would greatly appreciate it 🙏🙏🙏
@prudant
Hi, yes this model can be quantized and I recommend you to try HF Optimum library for easy quantization. You can check their documentation here: https://huggingface.co/docs/optimum/onnxruntime/usage_guides/quantization
It would be good to also have some validation dataset which you can use to verify that that quantization didn't hurt accuracy too much on your downstream task for the embeddings. I'll probably try quantize this model sometime later too and can report back the results.
thanks, will try it right now
did'nt work:
I follow the quant guide (pretty simple steps):
[CONTAINER] ~/src/onnx $ optimum-cli onnxruntime quantize --onnx_model bge-m3-onnx/ --avx512_vnni -o quantized_model/
/home/dario/.local/lib/python3.10/site-packages/transformers/utils/hub.py:124: FutureWarning: Using TRANSFORMERS_CACHE
is deprecated and will be removed in v5 of Transformers. Use HF_HOME
instead.
warnings.warn(
Creating dynamic quantizer: QOperator (mode: IntegerOps, schema: u8/s8, channel-wise: False)
Quantizing model...
Traceback (most recent call last):
File "/home/dario/.local/bin/optimum-cli", line 8, in
sys.exit(main())
File "/home/dario/.local/lib/python3.10/site-packages/optimum/commands/optimum_cli.py", line 163, in main
service.run()
File "/home/dario/.local/lib/python3.10/site-packages/optimum/commands/onnxruntime/quantize.py", line 102, in run
q.quantize(save_dir=save_dir, quantization_config=qconfig)
File "/home/dario/.local/lib/python3.10/site-packages/optimum/onnxruntime/quantization.py", line 417, in quantize
quantizer.quantize_model()
File "/home/dario/.local/lib/python3.10/site-packages/onnxruntime/quantization/onnx_quantizer.py", line 403, in quantize_model
op_quantizer.quantize()
File "/home/dario/.local/lib/python3.10/site-packages/onnxruntime/quantization/operators/matmul.py", line 78, in quantize
otype = self.quantizer.get_tensor_type(node.output[0], mandatory=True)
File "/home/dario/.local/lib/python3.10/site-packages/onnxruntime/quantization/onnx_quantizer.py", line 461, in get_tensor_type
raise RuntimeError(f"Unable to find data type for weight_name={tensor_name!r}")
RuntimeError: Unable to find data type for weight_name='/model/encoder/layer.0/attention/output/dense/MatMul_output_0'
and get that stack trace, didnt get any information for that error on the net (I tried with your onnx version of bge-m3)
No luck at this time :(
Interesting, I'll have to try the quantization too sooner or later, and debug this!
Hi
@prudant
I finally tried the quantization myself too. I got the same error but found this discussion about the same error: https://discuss.huggingface.co/t/optimum-library-optimization-and-quantization-fails/72629
I tried the suggestion to downgrade onnxruntime to version 1.16 and it solved the issue for me. I was able to quantize the onnx version of the BGE-M3 model, and on my information retrieval test set I was getting just about 1% worse accuracies than the non-quantized version.
wow, and do you measure the inference type? i'm interest in run the model on cpu (not gpu for now), I have tried many settings but did not got lucky, can you upload the model please to the hub please, ot would be very helpful.