Optimize inference speed

#9
by CoolWP - opened

Can it be applied ONNX optimization to improve inference speed?

Beijing Academy of Artificial Intelligence org

Yes. There are some open-sourced onnx models in huggingface, like: https://huggingface.co/aapot/bge-m3-onnx

@CoolWP I am maintainer of https://github.com/michaelfeil/infinity - bge-m3 is compatible and will accelerate your inference speed on gpu around 2-3x by using (async tokenization, fp16, flash-attention, torch nested, torch.compile)

@michaelfeil hi!, nice project, I have 2 questions:

  1. it will accelerate CPU inference?
  2. on GPU it will reduce the VRAM usage, or only performance optimizations are supported ?

I'm running low on VRAM

It will reduce VRAM by 0.5 by using fp16 precision, and can dispatch e.g. memory-efficient attention. If you go for the full-sequence length, I would suggest to limit batch size in infinity to 8.
You can also run ONNX inference (no onnx version for this model at this point in time), which will give you the best in class acceleration for CPU on intel / amd.

@CoolWP Hi!,

i'm trying infinity with BAAI/bge-m3 but i'm only getting the embeddings results, and the rerank endpoint will not work I suspect to get the scores.... is there any way to get the model scores

ex:

{

'colbert': [0.7796499729156494, 0.4621465802192688, 0.4523794651031494, 0.7898575067520142],

'sparse': [0.195556640625, 0.00879669189453125, 0.0, 0.1802978515625],

'dense': [0.6259765625, 0.347412109375, 0.349853515625, 0.67822265625],

'sparse+dense': [0.482503205537796, 0.23454029858112335, 0.2332356721162796, 0.5122477412223816],

'colbert+sparse+dense': [0.6013619303703308, 0.3255828022956848, 0.32089319825172424, 0.6232916116714478]

}

it will be very useful because this feature is the most relevant in my opinion for this great multilingual model, may be thru the re-rank endpoint.

regards

Sign up or log in to comment