Spaces:
Running
VRAM requirements? #8
by
yahma
- opened
What are the VRAM requirements to run this model? Is it possible to run it 8-bit or 4-bit quantized on a single 24GB GPU?
I was able to load it in 8 bit and did some offloading to memory and disk via accelerate , but for some reason the generate method kept running infinitely
Hi! You can try 8-bit quant, which is integrated in HF and should reduce the memory footprint down to ~20GB (and sure it needs additional several GB for inference).
After installing accelerate
and bitsandbytes
, load the model in 8-bit:
model = AutoModelForCausalLM.from_pretrained('togethercomputer/GPT-NeoXT-Chat-Base-20B', device_map="auto", load_in_8bit=True)
Can you share the generate code as well. Like the complete code for taking question and generating output
Sure, here is an example:
inputs = tokenizer("<human>: Where is Zurich?\n<bot>:", return_tensors='pt').to(model.device)
outputs = model.generate(
**inputs,
do_sample=True,
top_p=0.6,
top_k=40,
repetition_penalty=1.0,
temperature=0.8,
max_new_tokens=10,
)
print(tokenizer.decode(outputs[0]))
So human inputs should be prefixed with ":" and bot responses should be prefixed with ":".