extremely variable response time for inference?

#21
by silvacarl - opened

This model is awesome. But for some reason we are getting extremely variable response time for inference, anywhere between 0.40 seconds and 15 sewconds on an A40.

Could this be cauae by the prompt format or otehr inference parameters?

Berkeley-Nest org

That's interesting.. Actually I never experienced this before. What kind of inference package are you using? TGI, vLLM or other stuff?

just something we have had to test other models for quite a while. we are checking to see if somehow it is doing something weird. just thought it would be good to post here to check.

BUT: I CANTELL YOU SO FAR ITS INSANELY AWESOME. 8-)

Like crazy accurate.

Berkeley-Nest org

Haha thank you! I'm glad you like it! It's also likely due to mistral structure itself? Not sure if mistral base / instruct will have the same issue.

yes, it could be that as well. we will check that, running additional tests now.

its kind of interesting. it flies then for some reason it will sit on one inference for about 15 seconds. then it flies again.

just in case, what is the prompt format? Can you post an example?

Berkeley-Nest org

The prompt format is listed in the model card. FYI is

GPT4 Correct User: Hello<|end_of_turn|>GPT4 Correct Assistant: Hi<|end_of_turn|>GPT4 Correct User: How are you today?<|end_of_turn|>GPT4 Correct Assistant:

But I don't think prompt format will change inference speed. Is it only happening for starling but not other mistral-based model? That is very mysteriours..

yeah tracing the code to see what's up, its reallt weird.

Is it slow for the same prompt consistently?

The model is just too slow actually. I have been testing this model on an AWS instance with NVIDIA Tesla T4 GPU and it takes 2-3 minutes for each response. Once, it took about 9 minutes to generate a simple response. IDK what is going on and my internet is good too.

Sign up or log in to comment