Inference VRAM Size

#4
by tjohnson - opened

Hello,

Thank you for such a tremendous contribution! I have tried running inference on my RTX4090 (24GB vram) to no avail so I used TheBloke's rendition of GGML and GPTQ which work great but verrrrry slow. Which is in direct contrast to your starchat playground which is lightning fast...

I would like to try inference with this repos (native) weights on a GPU to get somewhere in the ballpark of the speed of your playground but how many GB do I need? Do I need to rent like an A100 80?

Ditto. I have the same question.

I'm running it on an A100 80 and most of the time it's using 30GB of VRAM, peaking at 48GB.

If you want to safe money, you should import it in 4-bit mode you need only 10gb of GPU RAM

More info: Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/starchat-beta")
model = AutoModelForCausalLM.from_pretrained("HuggingFaceH4/starchat-beta",load_in_4bit=True, device_map="auto")

@Maxrubino what versions of related quantization dependencies are you running? I get this exception on the last line:

TypeError: GPTBigCodeForCausalLM.__init__() got an unexpected keyword argument 'load_in_4bit'

transformers==4.30.2

Sign up or log in to comment