VRAM requirements?

#8
by yahma - opened

What are the VRAM requirements to run this model? Is it possible to run it 8-bit or 4-bit quantized on a single 24GB GPU?

I was able to load it in 8 bit and did some offloading to memory and disk via accelerate , but for some reason the generate method kept running infinitely

Together org

Hi! You can try 8-bit quant, which is integrated in HF and should reduce the memory footprint down to ~20GB (and sure it needs additional several GB for inference).
After installing accelerate and bitsandbytes, load the model in 8-bit:

model = AutoModelForCausalLM.from_pretrained('togethercomputer/GPT-NeoXT-Chat-Base-20B', device_map="auto", load_in_8bit=True)

Can you share the generate code as well. Like the complete code for taking question and generating output

Together org

Sure, here is an example:

inputs = tokenizer("<human>: Where is Zurich?\n<bot>:", return_tensors='pt').to(model.device)

outputs = model.generate(
    **inputs, 
    do_sample=True, 
    top_p=0.6,
    top_k=40,
    repetition_penalty=1.0,
    temperature=0.8,
    max_new_tokens=10,
)

print(tokenizer.decode(outputs[0]))

So human inputs should be prefixed with ":" and bot responses should be prefixed with ":".

Sign up or log in to comment