Can the quantized model be loaded in gpu to have faster inference ?

#1
by MohamedRashad - opened

Is there a way to load this model into gpu and use the acceleration benefit ?

How ?
And can i use it with something like Nvidia triton ?

Looks like the inference binary should be compiled using CUDA for this - https://github.com/ggerganov/llama.cpp#blas-build
But maybe it's better to quantized for nvidia gpus version of this model - something like starchat-alpha-GPTQ. I don't have Nvidia GPU, so I don't know if this version exists or how to create it.

You can run this ggml model in llama.cpp with GPU.

I don't think this model can be run by llama.cpp just yet - https://github.com/ggerganov/llama.cpp/issues/1441

For now there is only example code here - https://github.com/ggerganov/ggml/tree/master/examples/starcoder

This code works, but not very useful: it loads model, generates reply to single prompt and shutting down. Now I keep experimenting with this code to get conversation loop, but have troubles with it - looks like I didn't get how to correctly manage memory. It breaks after single iteration of loop with "not enough memory in context". Will see if I can do better.

Sign up or log in to comment