VRAM requirements?

What are the VRAM requirements to run this model? Is it possible to run it 8-bit or 4-bit quantized on a single 24GB GPU?

I was able to load it in 8 bit and did some offloading to memory and disk via accelerate , but for some reason the generate method kept running infinitely

Hi! You can try 8-bit quant, which is integrated in HF and should reduce the memory footprint down to ~20GB (and sure it needs additional several GB for inference).
After installing accelerate and bitsandbytes, load the model in 8-bit:

model = AutoModelForCausalLM.from_pretrained('togethercomputer/GPT-NeoXT-Chat-Base-20B', device_map="auto", load_in_8bit=True)

Can you share the generate code as well. Like the complete code for taking question and generating output

Sure, here is an example:

inputs = tokenizer("<human>: Where is Zurich?\n<bot>:", return_tensors='pt').to(model.device)

outputs = model.generate(


So human inputs should be prefixed with ":" and bot responses should be prefixed with ":".

