Optimize inference speed

by CoolWP - opened Feb 12, 2024

Discussion

CoolWP

Feb 12, 2024

Can it be applied ONNX optimization to improve inference speed?

Shitao

Beijing Academy of Artificial Intelligence org Feb 16, 2024

Yes. There are some open-sourced onnx models in huggingface, like: https://huggingface.co/aapot/bge-m3-onnx

michaelfeil

Feb 20, 2024

@CoolWP I am maintainer of https://github.com/michaelfeil/infinity - bge-m3 is compatible and will accelerate your inference speed on gpu around 2-3x by using (async tokenization, fp16, flash-attention, torch nested, torch.compile)

prudant

Feb 20, 2024

@michaelfeil hi!, nice project, I have 2 questions:

it will accelerate CPU inference?
on GPU it will reduce the VRAM usage, or only performance optimizations are supported ?

I'm running low on VRAM

michaelfeil

Feb 20, 2024

•

edited Feb 20, 2024

It will reduce VRAM by 0.5 by using fp16 precision, and can dispatch e.g. memory-efficient attention. If you go for the full-sequence length, I would suggest to limit batch size in infinity to 8.
You can also run ONNX inference (no onnx version for this model at this point in time), which will give you the best in class acceleration for CPU on intel / amd.

prudant

Feb 21, 2024

@CoolWP Hi!,

i'm trying infinity with BAAI/bge-m3 but i'm only getting the embeddings results, and the rerank endpoint will not work I suspect to get the scores.... is there any way to get the model scores

ex:

{

'colbert': [0.7796499729156494, 0.4621465802192688, 0.4523794651031494, 0.7898575067520142],

'sparse': [0.195556640625, 0.00879669189453125, 0.0, 0.1802978515625],

'dense': [0.6259765625, 0.347412109375, 0.349853515625, 0.67822265625],

'sparse+dense': [0.482503205537796, 0.23454029858112335, 0.2332356721162796, 0.5122477412223816],

'colbert+sparse+dense': [0.6013619303703308, 0.3255828022956848, 0.32089319825172424, 0.6232916116714478]

}

it will be very useful because this feature is the most relevant in my opinion for this great multilingual model, may be thru the re-rank endpoint.

regards

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment