Question about fine-tuning flan-ul2 with LoRA

#1
by cyt79 - opened

Hi there,

I'm also trying to finetune flan-ul2 (google/flan-ul2) using LoRA and I have few questions if you don't mind:

I'm trying to do this on a p3dn.24xlarge instance (8 GPUs with 32 GB gpu memory each). I follow this blog post (https://www.philschmid.de/fine-tune-flan-t5-peft) which is written for finetuning t5-xxl with lora. When I use more than one GPU, I'm getting this error:

RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:2

Therefore, I try to finetune flan-ul2 using only one of the 8 GPUs but it doesn't help me either because this time I got:

RuntimeError: No executable batch size found, reached zero.

which doesn't make sense at all because I didn't change anything related to the data processing.

So, my questions are:

  1. Did you load the mode in 8 bit while fine-tuning (i.e. set load_in_8bit=True)?
  2. Were you using multi-GPU for fine-tuning?
  3. Did you encounter any of these errors when you're fine-tuning? If so, could you please share with me how did you fix them?

In case you're wondering what I've tried to do to fine-tune flan-ul2 using LoRA, I didn't change too many things on the blog post. All I did was to change the model name actually:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from transformers import DataCollatorForSeq2Seq

#model_id="google/flan-t5-xxl"
model_id="google/flan-ul2"
tokenizer = AutoTokenizer.from_pretrained(model_id)

#model_id = "philschmid/flan-t5-xxl-sharded-fp16"
model_id = "google/flan-ul2"
model = AutoModelForSeq2SeqLM.from_pretrained(model_id, device_map="auto", load_in_8bit=True)

Then I run the trainer as shown below:

from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments

# Define training args
training_args = Seq2SeqTrainingArguments(
    output_dir=output_dir,
    auto_find_batch_size=True,
    learning_rate=1e-3, # higher learning rate
    num_train_epochs=5,
    logging_dir=f"{output_dir}/logs",
    logging_strategy="steps",
    logging_steps=500,
    save_strategy="no",
    report_to="tensorboard",
)

# Create Trainer instance
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_dataset["train"],
)

Hi, I didn't use that code, but I think it might be running out of memory. 32 GB machines aren't large enough to fit a single UL2 model even in bf16 or fp16 precision.

We trained it in full precision cause v100s don't support bf16. We had to use Deepspeed's CPU offloading to train it. It takes quite a lot of time to train.

https://medium.com/vmware-data-ml-blog/lora-finetunning-of-ul-2-and-t5-models-35a08863593d
You can find the code and the details we used for fine-tuning in the above blog.

Many thanks for your reply, and also sharing the blog post! I'll have a look at it and see if I can manage to fine-tune it as well.

Hi @Teja-Gollapudi ,

I went through the medium article, thanks again for sharing it! I have just three follow up questions, if possible:

  1. Through the blog post, it says the model has been trained on 3 GPUs, but in the screenshot in step-2, I see that the answer to "How many GPU(s) should be used for distributed training?" is set to 2. Are these two things different? Also, would it be possible to share the launcher_config.yaml file as well?

  2. Once fine-tuning is done, would it be possible to load model in 8bit (i.e. setting load_in_8bit=True in from_pretrained() method) and do inference?

  3. I have 8 V100 GPUs. Which parameters in the yaml files do you recommend to change in this case? For example, should I still keep how many GPU(s) should be used for distributed training to 8-1=7 ?(becase in your case training was done in 3 gpus and this parameter was set to 2.)

Hi,

  1. Thanks for pointing it out. I didn't notice the screenshot, I'll fix it. It should be set to 3 not 2 to use with 3 GPUs.
  2. It should be doable but I think you would have to tweak the transformers / Cuda toolkit library versions to make it work. It might not yield significant speed up though ((https://huggingface.co/blog/hf-bitsandbytes-integration#is-it-faster-than-native-models)
  3. I would recommend you change the gradient accumulation steps based on the formula int(desired_batch_size/(num-GPUs* per device batch size)) and set the number of GPUs to 8, not 7.

I also recommend you play around with the learning rate (something between 1e-5 to 1e-4 to stick with the UL2 paper LRs), per_device_batch_size, and source/ target length parameters for training. ( If you increase the target length beyond 256, you might have to use a batch size of 1, etc).

Hey @Teja-Gollapudi . Thanks for the tips! There is one more thing I'm wondering: Did you compare the performance of the base and fine-tuned models on a dataset? I'm just wondering how much the finu-tuning improved the results?

We haven't performed any benchmarking, mainly because we lack benchamrks for what we were trying to achieve.

We were trying to get Flan-Ul2 to follow more free form instructions , not just the flan template instructions. When we compared its outputs to the plain flan-ul2 model on a few free form instructions , it seemed to do better but we don't have any quantifiable evaluation metric to compare. We didn't do any hyper-parameter search either.

Alpaca dataset has its own limitations which were later addressed by the alpaca-cleaned dataset. Using the the newer instruction datasets might yield better results.

Got it, thanks @Teja-Gollapudi ! One last thing: Did you try doing inference by setting load_in_8bit to True after finetuning the model? It seems like I can perform inference on float16 on GPU or on CPU but when I try to do 8bit inference, I got r: probability tensor contains either inf, nan or element < 0 error. Here is what I'm doing for inference:

merged_model ='directory where merged model is stored' (output of  merge_weights.py file)
model = AutoModelForSeq2SeqLM.from_pretrained(merged_model,load_in_8bit=True,device_map='auto')
tokenizer= AutoTokenizer.from_pretrained('google/flan-ul2)

prompt_template = "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\n{instruction}\n\n### Response:"
prompt = "Draft me an introduction section of a medium article on the topic 'Efficient Fine-tuning of UL-2 and T5 Models Using LoRA on Limited Compute"

input = prompt_template.format(instruction=prompt)
input_ids = tokenizer(input, return_tensors="pt", truncation=True).input_ids.cuda()
outputs = model.generate(input_ids=input_ids,max_new_tokens=128)

Full error log:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ubuntu/miniconda3/envs/lora_training/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/lora_training/lib/python3.8/site-packages/transformers/generation/utils.py", line 1452, in generate
    return self.sample(
  File "/home/ubuntu/miniconda3/envs/lora_training/lib/python3.8/site-packages/transformers/generation/utils.py", line 2504, in sample
    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

Sorry, never tried 8 bit inference with bitsandbytes.

Sign up or log in to comment