quants

by prudant - opened Feb 16

Feb 16

Hi, thanks for the onnx version!

I am a newbie to onnx (never used before), my question is: this model can be quantized?

I readed something about onnx and quantization in the runtime docs but the truth is I couldn't understand what method or tool I should use to quantize this model for cpu and avx512 use. Could you enlighten me on how to achieve quantization, I would greatly appreciate it 🙏🙏🙏

aapot

Owner Feb 19

@prudant Hi, yes this model can be quantized and I recommend you to try HF Optimum library for easy quantization. You can check their documentation here: https://huggingface.co/docs/optimum/onnxruntime/usage_guides/quantization
It would be good to also have some validation dataset which you can use to verify that that quantization didn't hurt accuracy too much on your downstream task for the embeddings. I'll probably try quantize this model sometime later too and can report back the results.

prudant

Feb 20

thanks, will try it right now

prudant

Feb 21

•

edited Feb 21

did'nt work:
I follow the quant guide (pretty simple steps):

[CONTAINER] ~/src/onnx $ optimum-cli onnxruntime quantize --onnx_model bge-m3-onnx/ --avx512_vnni -o quantized_model/

/home/dario/.local/lib/python3.10/site-packages/transformers/utils/hub.py:124: FutureWarning: Using TRANSFORMERS_CACHE is deprecated and will be removed in v5 of Transformers. Use HF_HOME instead.
warnings.warn(
Creating dynamic quantizer: QOperator (mode: IntegerOps, schema: u8/s8, channel-wise: False)
Quantizing model...
Traceback (most recent call last):
File "/home/dario/.local/bin/optimum-cli", line 8, in
sys.exit(main())
File "/home/dario/.local/lib/python3.10/site-packages/optimum/commands/optimum_cli.py", line 163, in main
service.run()
File "/home/dario/.local/lib/python3.10/site-packages/optimum/commands/onnxruntime/quantize.py", line 102, in run
q.quantize(save_dir=save_dir, quantization_config=qconfig)
File "/home/dario/.local/lib/python3.10/site-packages/optimum/onnxruntime/quantization.py", line 417, in quantize
quantizer.quantize_model()
File "/home/dario/.local/lib/python3.10/site-packages/onnxruntime/quantization/onnx_quantizer.py", line 403, in quantize_model
op_quantizer.quantize()
File "/home/dario/.local/lib/python3.10/site-packages/onnxruntime/quantization/operators/matmul.py", line 78, in quantize
otype = self.quantizer.get_tensor_type(node.output[0], mandatory=True)
File "/home/dario/.local/lib/python3.10/site-packages/onnxruntime/quantization/onnx_quantizer.py", line 461, in get_tensor_type
raise RuntimeError(f"Unable to find data type for weight_name={tensor_name!r}")
RuntimeError: Unable to find data type for weight_name='/model/encoder/layer.0/attention/output/dense/MatMul_output_0'

and get that stack trace, didnt get any information for that error on the net (I tried with your onnx version of bge-m3)

No luck at this time :(

aapot

Owner Feb 22

Interesting, I'll have to try the quantization too sooner or later, and debug this!

aapot

Owner Mar 17

Hi @prudant I finally tried the quantization myself too. I got the same error but found this discussion about the same error: https://discuss.huggingface.co/t/optimum-library-optimization-and-quantization-fails/72629
I tried the suggestion to downgrade onnxruntime to version 1.16 and it solved the issue for me. I was able to quantize the onnx version of the BGE-M3 model, and on my information retrieval test set I was getting just about 1% worse accuracies than the non-quantized version.

aapot changed discussion status to closed Mar 17

prudant

Mar 18

wow, and do you measure the inference type? i'm interest in run the model on cpu (not gpu for now), I have tried many settings but did not got lucky, can you upload the model please to the hub please, ot would be very helpful.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment