TheBloke/VicUnlocked-30B-LoRA-GGML · Hardware recommendations

May 20, 2023

First of all i want to say thank you for all your efforts, in addition could you provide some advice on the recommended hardware specifications required to effectively run models of this size ?

mancub

May 20, 2023

According to the model card, 4_0 should fit all layers into a 3090.

I was going to download 5_1 and try that offloading some layers to RAM. Though haven't tried it yet so maybe I'm wrong in my thinking.

Romanserk

May 20, 2023

Thank you for your reply, when you say should fit all layers what does it mean, can i expect response time faster then 100 tokens per second ? And is it possible to run on multiple graphics cards ?

mancub

May 20, 2023

Most of these models we use have 40 layers in them and so loading the entire model into the VRAM greatly speeds up inference. When you load less than 40 layers (the entire model), then portion of the model is loaded into RAM and inevitably a slowdown happens. Here's a random link that will explain layers, contexts, inference, etc: https://kipp.ly/blog/transformer-inference-arithmetic/

I do not know of any consumer setup (graphics cards) that can produce 100t/s.

Here I've noticed a nice improvement when combining CPU+GPU to get ~10t/s. Prior to this on pure GPU I was getting 5-6t/s.

You can load the model across multiple graphics cards, but inference will only happen on one as far as I understand it.