inference and generation runtime - how to reduce latency

#38
by wamozart - opened

Hi all,

This is a great model but I was wondering how can I speed up the inference time. My app is running such that it accepts two image + text and create a comparison of them.

Running on EC2 g4.2xl, inference and response time is about 5-6 seconds. I've tried a new generation gpu (H family) but didn't see any improvements (which is kind of weird). I also tried to load in 4bit but had some issues. The only improvement I saw is when I used onnxruntime-genai, (https://huggingface.co/microsoft/Phi-3-vision-128k-instruct-onnx-cuda) but unfortunately, the implementation doesn't allow multiple images as inputs.

Would be happy for your suggestions.

Tnx

Sign up or log in to comment