What are the VRAM requirements to run this model? Is it possible to run it 8-bit or 4-bit quantized on a single 24GB GPU?
I was able to load it in 8 bit and did some offloading to memory and disk via accelerate , but for some reason the generate method kept running infinitely
Hi! You can try 8-bit quant, which is integrated in HF and should reduce the memory footprint down to ~20GB (and sure it needs additional several GB for inference).
bitsandbytes, load the model in 8-bit:
model = AutoModelForCausalLM.from_pretrained('togethercomputer/GPT-NeoXT-Chat-Base-20B', device_map="auto", load_in_8bit=True)
Can you share the generate code as well. Like the complete code for taking question and generating output
Sure, here is an example:
inputs = tokenizer("<human>: Where is Zurich?\n<bot>:", return_tensors='pt').to(model.device) outputs = model.generate( **inputs, do_sample=True, top_p=0.6, top_k=40, repetition_penalty=1.0, temperature=0.8, max_new_tokens=10, ) print(tokenizer.decode(outputs))
So human inputs should be prefixed with ":" and bot responses should be prefixed with ":".