Lora Training OOM with 2x NVIDIA RTX A6000 (2x48GB)

#71

by ayyylemao - opened Jun 19, 2024

Jun 19, 2024

I have two RTX A6000 which totals to 96GB of VRAM but when I try to fine tune the model with Lora i immediatly get torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU.
Even with batch size = 1 it ooms the first GPU instantly while the 2nd one still has space.
Is this just not enough memory or is something wrong with my code?

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A    442552      C   ...s/idefics-finetune/.venv/bin/python      48500MiB |
|    1   N/A  N/A    442552      C   ...s/idefics-finetune/.venv/bin/python      19816MiB |
+-----------------------------------------------------------------------------------------+

Here is my training script. I hope there's something wrong with my code:

import torch
from peft import LoraConfig
from transformers import AutoProcessor, BitsAndBytesConfig, Idefics2ForConditionalGeneration
from datasets import load_dataset

USE_LORA = True
USE_QLORA = False

processor = AutoProcessor.from_pretrained(
    "HuggingFaceM4/idefics2-8b",
    do_image_splitting=False
)


# Three options for training, from the lowest precision training to the highest precision training:
# - QLora
# - Standard Lora
# - Full fine-tuning
if USE_QLORA or USE_LORA:
    lora_config = LoraConfig(
        r=8,
        lora_alpha=8,
        lora_dropout=0.1,
        target_modules='.*(text_model|modality_projection|perceiver_resampler).*(down_proj|gate_proj|up_proj|k_proj|q_proj|v_proj|o_proj).*$',
        use_dora=False,# if USE_QLORA else True,
        init_lora_weights="gaussian"
    )
    if USE_QLORA:
        bnb_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=torch.float16
        )
    model = Idefics2ForConditionalGeneration.from_pretrained(
        "HuggingFaceM4/idefics2-8b",
        torch_dtype=torch.float16,
        quantization_config=bnb_config if USE_QLORA else None,
        #attn_implementation='flash_attention_2'
        #device_map='auto'
    )
    model.add_adapter(lora_config)
    model.enable_adapters()
else:
    model = Idefics2ForConditionalGeneration.from_pretrained(
        "HuggingFaceM4/idefics2-8b",
        torch_dtype=torch.float16,
        _attn_implementation="flash_attention_2", # Only available on A100 or H100
    )






dataset = load_dataset("dataset/malicious", split="train")
split = dataset.train_test_split(test_size=0.5)
train_dataset = split['train']
test_dataset = split['test']
p_brand = '''What brands can you see on the image?'''

class MyDataCollator:
    def __init__(self, processor):
        self.processor = processor
        self.image_token_id = processor.tokenizer.additional_special_tokens_ids[
            processor.tokenizer.additional_special_tokens.index("<image>")
        ]

    def __call__(self, examples):
        texts = []
        images = []
        for example in examples:
            image = example["image"]
            question = p_brand
            answer = example["brand"]
            messages = [
                {
                    "role": "user",
                    "content": [
                        {"type": "image"},
                        {"type": "text", "text": question},
                    ]
                },
                {
                    "role": "assistant",
                    "content": [
                        {"type": "text", "text": answer}
                    ]
                }
            ]
            text = self.processor.apply_chat_template(messages, add_generation_prompt=False)
            texts.append(text.strip())
            images.append([image])

        batch = processor(text=texts, images=images, return_tensors="pt", padding=True)

        labels = batch["input_ids"].clone()
        labels[labels == processor.tokenizer.pad_token_id] = self.image_token_id
        batch["labels"] = labels
        return batch

data_collator = MyDataCollator(processor)

from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    num_train_epochs=2,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=1,
    gradient_accumulation_steps=8,
    warmup_steps=0,
    learning_rate=1e-4,
    weight_decay=0.01,
    logging_steps=1,
    output_dir="output/test-brand-001",
    save_strategy="steps",
    save_steps=10,
    save_total_limit=1,
    evaluation_strategy="epoch",
    fp16=True,
    remove_unused_columns=False,
    report_to="none",
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=test_dataset, # You can also evaluate (loss) on the eval set, note that it will incur some additional GPU memory
)

trainer.train()

Any help would be greatly appreciated.

jxue005

Jun 20, 2024

Try to add a MAX_LENGTH to your batch = processor(text=texts, images=images, padding=True, truncation=True, max_length=MAX_LENGTH, return_tensors="pt"), I am setting MAX_LENGTH = 768 for my case.

jxue005

Jun 20, 2024

Also I commented out eval_dataset, and evaluation_strategy for my example.

ayyylemao

Jun 21, 2024

Try to add a MAX_LENGTH to your batch = processor(text=texts, images=images, padding=True, truncation=True, max_length=MAX_LENGTH, return_tensors="pt"), I am setting MAX_LENGTH = 768 for my case.

Thank you for those tipps, but it seems that even when I put max_length=200 or lower it still ooms the first GPU instantly when starting training.
Are you training with LORA on just 2 A6000 GPUs and it works?

andito

HuggingFaceM4 org Jun 22, 2024

Hi! can you push some dataset samples to the hub? I could re-run your code with 2xH100 and report how the memory is distributed

rokopi

Jun 22, 2024

Hi, I'm having a similar issue using the same GPUs (2xA6000).
I'm trying to reproduce this tutorial:
https://github.com/NielsRogge/Transformers-Tutorials/blob/master/Idefics2/Fine_tune_Idefics2_for_multi_page_PDF_question_answering_on_DUDE.ipynb

My only modification is to use devices=2 in the lightning trainer.

Using QLORA I get this issue: https://github.com/TimDettmers/bitsandbytes/issues/89#issuecomment-2094943374

Using LORA it goes OOM.

ayyylemao

Jun 24, 2024

Hi! can you push some dataset samples to the hub? I could re-run your code with 2xH100 and report how the memory is distributed

Thanks for your interest in this issue.
I've uploaded a small subset of the dataset under: "ayyylemao/idefics2-test"
Regards

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment