One gpu version

#22
by joorei - opened

Hello, thanks for the model? Are there any plans to make a version of this that is usable with a single GPU with 24GB VRAM?

Google org

hi @joorei
You can try to run it in 8-bit and see if it works:
First install the main branch of transformers

pip install git+https://github.com/huggingface/transformers@main

Install accelerate and bitsandbytes

pip install accelerate bitsandbytes

and load the model with the flag load_in_8bit=True when calling .from_pretrained

from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-xxl")
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-xxl", device_map="auto", load_in_8bit=True)

input_text = "translate English to German: How old are you?"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))

Did it work?

What are general system recommendations for different versions on flan-t5?

Thank you!

Google org

Hi @Mayuresh86 you can also run it in 4bit now

pip install -U transformers bitsandbytes accelerate

then run:

from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-xxl")
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-xxl", device_map="auto", load_in_4bit=True)

input_text = "translate English to German: How old are you?"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")

outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))

Sign up or log in to comment