Splitting the model of multiple GPU's

#4
by dnhkng - opened

Is there any documentation on splitting such models up for inferencing over multiple GPU's?
2nd-hand 3090 Ti's are getting quite affordable now, and I believe the model would be able to fit on the ram of two such cards, at least purely by size.

This comment has been hidden

Hi! I think I managed to run UL2 on 2 RTX3090 GPUs. I deleted last comment to avoid confusion.
I configured hugging face accelarate and ran the code snippet below. It worked! Seems that the assignment of layers on multi-GPU is by auto device map. Hope it helps.

logging.info('build tokenizer')
tokenizer = AutoTokenizer.from_pretrained("google/ul2")
logging.info('build model')
                                                                                              
model = T5ForConditionalGeneration.from_pretrained("google/ul2", low_cpu_mem_usage=True, torch_dtype=torch.bfloat16, 
                                                    device_map='auto')

input_string = "[S2S] Mr. Dursley was the director of a firm called Grunnings, which made drills. He was a big, solid man with a bald head. Mrs. Dursley was thin and blonde and more than the usual amount of neck, which came in very useful as she spent so much of her time craning over garden fences, spying on the neighbours. The Dursleys had a small son called Dudley and in their opinion there was no finer boy anywhere <extra_id_0>"                                               

inputs = tokenizer(input_string, return_tensors="pt").input_ids.to("cuda")

logging.info('generate output')

outputs = model.generate(inputs, max_length=200)

logging.info(tokenizer.decode(outputs[0]))

I have it running on the RTX Titans, but the output looks very weird.
I first had to load the model, and then save if as FP16, as my cards do not support bfloat16.

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, AutoConfig
from accelerate import init_empty_weights, infer_auto_device_map, load_checkpoint_and_dispatch
import torch

import torch
model_name = "google/ul2"
config = AutoConfig.from_pretrained(model_name)

with init_empty_weights():
        model = AutoModelForSeq2SeqLM.from_config(config)
        
tokenizer = AutoTokenizer.from_pretrained("google/ul2")

device_map = infer_auto_device_map(model,dtype=torch.float16)
weights_path = '.'
load_checkpoint_and_dispatch(
    model,
    weights_path,
    device_map=device_map,
    offload_folder=None,
    offload_state_dict=False,
    dtype="float16"
)

prompt = 'Machine learning is the '
input_tokenized = tokenizer(prompt, return_tensors="pt")
output = model.generate(input_tokenized["input_ids"].to(1), do_sample=True,max_length=100,temperature=0.9,top_k=50,top_p=0.9)
output_text = tokenizer.decode(output[0].tolist())
print(output_text)

The output from this script is quite fast after loading the model, a few tens of seconds, but the output reads as:

'<pad><extra_id_0> uimitmplă luminăn<pad><pad><pad><extra_id_0> uimit uimit<pad><extra_id_0> 
for the incendiu Project uimit uimitmplă machine learning uimit<extra_id_10> is<pad><pad><extra_id_0> 
qEinen incendiu în incendiu incendiu<pad><pad><extra_id_0>recommending<pad><extra_id_0><pad>
<extra_id_0> uimit lumină combining data and acţiune acţiune code uimit learning lumină presiune presiune 
acţiune; incendiu<extra_id_7> uimitency lumină lumină Artificial<extra_id_7> lumină incendiu incendiu 
incendiu<pad><extra_id_0><extra_id_74><extra_id_7><extra_id_9>a lumină lumină<extra_id_9>
<extra_id_7> uimit gradini treptat lumină presiune deep incendiu lumină lumină knowing that works 
is târziu<pad><extra_id_0><pad><pad>'

This might be to overflows in FP16 vs BFP16 maybe?

I am not an expert in LMs, therefore sorry I can't reason from your code and output.
My suggestion is to reproduce the output of the given example. I could, with the code I provided.

Ouch, thanks for the link!

Looks like I either need 2 more cards and convert to FP32, or replace my cards with bfloat16 capable ones, damn...

dnhkng changed discussion status to closed

Sign up or log in to comment