How to parallelize starcoder inference?

#93
by Cubby9059 - opened

Hello,
I am trying to deploy starcoder as an internal coding assistant for a team of 100 people. However, the model is taking too long for predictions especially when parallel requests are being made. I am using an Nvidia A100 40GB. Any suggestions on how to make faster inference?
Thank you.

BigCode org

you can try deploying the model with Text Generation Inference library that we use for the inference endpoints

Sign up or log in to comment