running this on a single RTX 4090 - optimizations / CUDA

#11
by ArbitrationCity - opened

Very impressive model and optimization! I am currently running it locally (using text-generation-webui) on a single RTX 4090 and getting well over 20 tokens per second, though it's quite variable dependent on context length. On par with ChatGPT as far as streaming is concerned, and the concise responses are good (and you can get pretty decent long ones too using the "continue" button)

I do get CUDA out of memory errors (on the server side) if the context is over a certain length, but it does not crash and processes the next query processes just fine.

Does anyone have suggestions on what further optimizations might be possible? text-generation-webui uses CUDA 11.7, which I understand does not take full advantage of the Ada Lovelace architecture, so I'd be curious what kind of speedup might be possible with CUDA 11.8 or even 12.1 (which I understand works with the latest Pytorch builds).

ArbitrationCity changed discussion title from running this on a single RTX 4090 to running this on a single RTX 4090 - optimizations / CUDA

Sign up or log in to comment