Awesome... but slow (at least on my system)

#1
by monkmartinez - opened

Thank you for posting this as I have been struggling to convert myself.

My only problem is not related to the model itself, but with inference performance. The 13GB unquantized models are just sooooo slow with text-generation-webui and my system. For reference, I have a Dell Precision with 32GB of RAM and Quadro P6000 with 24GB of VRAM.

13B: 1.13 token/s to 1.58token/s with this model.
7B: 6.85 token/s

So you can see it is painful to use the 13B models with that kind of speed... have any ideas to speed it up?

Yeah, quantisation should speed it up. You commented in my other repo, where I was describing all the problems I was having loading Koala myself. That repo is the GPTQ quantised versions of the 7B repo.

They take up a lot less space than the full model, requiring less RAM or VRAM, and should infer faster.

I've not run quantisation for 13B yet but will do it shortly and will upload it soon.

Sign up or log in to comment