Spaces:

togethercomputer
/

OpenChatKit

Running

VRAM requirements?

by yahma - opened Mar 13, 2023

Mar 13, 2023

What are the VRAM requirements to run this model? Is it possible to run it 8-bit or 4-bit quantized on a single 24GB GPU?

smjain

Mar 13, 2023

I was able to load it in 8 bit and did some offloading to memory and disk via accelerate , but for some reason the generate method kept running infinitely

juewang

Together org Mar 17, 2023

Hi! You can try 8-bit quant, which is integrated in HF and should reduce the memory footprint down to ~20GB (and sure it needs additional several GB for inference).
After installing accelerate and bitsandbytes, load the model in 8-bit:

model = AutoModelForCausalLM.from_pretrained('togethercomputer/GPT-NeoXT-Chat-Base-20B', device_map="auto", load_in_8bit=True)

smjain

Mar 19, 2023

Can you share the generate code as well. Like the complete code for taking question and generating output

juewang

Together org Mar 23, 2023

Sure, here is an example:

inputs = tokenizer("<human>: Where is Zurich?\n<bot>:", return_tensors='pt').to(model.device)

outputs = model.generate(
    **inputs, 
    do_sample=True, 
    top_p=0.6,
    top_k=40,
    repetition_penalty=1.0,
    temperature=0.8,
    max_new_tokens=10,
)

print(tokenizer.decode(outputs[0]))

So human inputs should be prefixed with ":" and bot responses should be prefixed with ":".

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment