Image-Text-to-Text
Transformers
Safetensors
English
idefics2
pretraining
multimodal
vision
Inference Endpoints
5 papers

Cuda OOM by simply doing a forward pass on an A6000 (48GB VRAM)

#11
by starzmustdie - opened

Hi,

I seem to be getting OOM by merely trying to do a forward pass with 4k text sequence length, and no images as input.

Following is an example script for simply doing a forward pass with a 4-bit quantized version of the model:

DEVICE = "cuda:0"
USE_LORA = False
USE_QLORA = True

processor = AutoProcessor.from_pretrained(
    "HuggingFaceM4/idefics2-8b",
    do_image_splitting=False
)

if USE_QLORA or USE_LORA:
    lora_config = LoraConfig(
        r=32,
        lora_alpha=16,
        lora_dropout=0.1,
        target_modules='.*(text_model|modality_projection|perceiver_resampler).*(down_proj|gate_proj|up_proj|k_proj|q_proj|v_proj|o_proj).*$',
        use_dora=False if USE_QLORA else True,
        init_lora_weights="gaussian"
    )
    if USE_QLORA:
        bnb_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=torch.bfloat16
        )
    model = Idefics2ForConditionalGeneration.from_pretrained(
        "HuggingFaceM4/idefics2-8b",
        torch_dtype=torch.bfloat16,
        quantization_config=bnb_config if USE_QLORA else None,
    )
    model.add_adapter(lora_config)
    model.enable_adapters()
else:
    model = Idefics2ForConditionalGeneration.from_pretrained(
        "HuggingFaceM4/idefics2-8b",
        torch_dtype=torch.bfloat16,
        _attn_implementation="flash_attention_2", 
    ).to(DEVICE)


train_dataset = load_from_disk("./custom_datasets/idefics/train.hf")
test_dataset = load_from_disk("./custom_datasets/idefics/test.hf")


class MyDataCollator:
    def __init__(self, processor):
        self.processor = processor
        self.image_token_id = processor.tokenizer.additional_special_tokens_ids[
            processor.tokenizer.additional_special_tokens.index("<image>")
        ]

    def __call__(self, examples):
        texts = []
        for example in examples:
            messages = example["messages"]
            text = processor.apply_chat_template(messages, add_generation_prompt=False)
            texts.append(text.strip())

        batch = processor(text=texts, return_tensors="pt", padding=True)
        labels = batch["input_ids"].clone()
        labels[labels == processor.tokenizer.pad_token_id] = self.image_token_id
        batch["labels"] = labels
        return batch

data_collator = MyDataCollator(processor)
data_loader = DataLoader(train_dataset, batch_size=1, collate_fn=data_collator)

for batch in data_loader:
    for k,v in batch.items():
        print(k, "->", v.shape)
    out = model(**batch)
    break

Running it produces the following error:

output = lora_B(lora_A(dropout(x))) * scaling
             ~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 234.00 MiB. GPU 0 has a total capacity of 47.30 GiB of which 11.44 MiB is free. Including non-PyTorch memory, this process has 47.27 GiB memory in use. Of the allocated memory 46.30 GiB is allocated by PyTorch, and 484.16 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Following are the dims of the input sequences being printed:

input_ids -> torch.Size([1, 4273])
attention_mask -> torch.Size([1, 4273])
labels -> torch.Size([1, 4273])

Why do I get OOM on sequence length of 4k, without even using images as input?

I have fine-tuned Mistral-7b with QLoRA before, and it was fine.

In this case, how come the text backbone is not capable of even doing a forward pass at half the context length I used to fine-tune before?

Weirdly, if I try to run it without LORA/QLORA (both flags set to False), I don't get OOM. Here is the code I am running:

DEVICE = "cuda:0"
USE_LORA = False
USE_QLORA = False

processor = AutoProcessor.from_pretrained(
    "HuggingFaceM4/idefics2-8b",
    do_image_splitting=False
)

if USE_QLORA or USE_LORA:
    lora_config = LoraConfig(
        r=32,
        lora_alpha=16,
        lora_dropout=0.1,
        target_modules='.*(text_model|modality_projection|perceiver_resampler).*(down_proj|gate_proj|up_proj|k_proj|q_proj|v_proj|o_proj).*$',
        use_dora=False if USE_QLORA else True,
        init_lora_weights="gaussian"
    )
    if USE_QLORA:
        bnb_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=torch.bfloat16
        )
    model = AutoModelForVision2Seq.from_pretrained(
        "HuggingFaceM4/idefics2-8b",
        torch_dtype=torch.bfloat16,
        quantization_config=bnb_config if USE_QLORA else None,
    )
    model.add_adapter(lora_config)
    model.enable_adapters()
else:
    model = AutoModelForVision2Seq.from_pretrained(
        "HuggingFaceM4/idefics2-8b",
        torch_dtype=torch.bfloat16,
        _attn_implementation="flash_attention_2", 
    ).to(DEVICE)


train_dataset = load_from_disk("./custom_datasets/idefics/train.hf")
test_dataset = load_from_disk("./custom_datasets/idefics/test.hf")


class MyDataCollator:
    def __init__(self, processor):
        self.processor = processor
        self.image_token_id = processor.tokenizer.additional_special_tokens_ids[
            processor.tokenizer.additional_special_tokens.index("<image>")
        ]

    def __call__(self, examples):
        texts = []
        for example in examples:
            messages = example["messages"]
            text = processor.apply_chat_template(messages, add_generation_prompt=False)
            texts.append(text.strip())

        batch = processor(text=texts, return_tensors="pt", padding=True)
        labels = batch["input_ids"].clone()
        labels[labels == processor.tokenizer.pad_token_id] = self.image_token_id
        batch["labels"] = labels
        return batch

data_collator = MyDataCollator(processor)
data_loader = DataLoader(train_dataset, batch_size=1, collate_fn=data_collator)

for batch in data_loader:
    batch = {k: v.to(DEVICE) for k, v in batch.items()}
    for k, v in batch.items():
        print(k, '->', v.shape)
    out = model(**batch)
    break

Note: the output shapes are the same:

input_ids -> torch.Size([1, 4273])
attention_mask -> torch.Size([1, 4273])
labels -> torch.Size([1, 4273])

While we are at it, I should note that I cannot even load the model only with LoRA adapters. It gets stuck at model.add_adapter(lora_config) for several minutes (and counting...). Here is the code for repro:

DEVICE = "cuda:0"
USE_LORA = True
USE_QLORA = False

processor = AutoProcessor.from_pretrained(
    "HuggingFaceM4/idefics2-8b",
    do_image_splitting=False
)

if USE_QLORA or USE_LORA:
    lora_config = LoraConfig(
        r=32,
        lora_alpha=16,
        lora_dropout=0.1,
        target_modules='.*(text_model|modality_projection|perceiver_resampler).*(down_proj|gate_proj|up_proj|k_proj|q_proj|v_proj|o_proj).*$',
        use_dora=False if USE_QLORA else True,
        init_lora_weights="gaussian"
    )
    if USE_QLORA:
        bnb_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=torch.bfloat16
        )
    model = AutoModelForVision2Seq.from_pretrained(
        "HuggingFaceM4/idefics2-8b",
        torch_dtype=torch.bfloat16,
        quantization_config=bnb_config if USE_QLORA else None,
    )
    print('---> loaded model')
    model.add_adapter(lora_config)
    print('---> added adapter')
    model.enable_adapters()
    print('---> enabled adapters')
else:
    model = AutoModelForVision2Seq.from_pretrained(
        "HuggingFaceM4/idefics2-8b",
        torch_dtype=torch.bfloat16,
        _attn_implementation="flash_attention_2", 
    ).to(DEVICE)


train_dataset = load_from_disk("./custom_datasets/idefics/train.hf")
test_dataset = load_from_disk("./custom_datasets/idefics/test.hf")


class MyDataCollator:
    def __init__(self, processor):
        self.processor = processor
        self.image_token_id = processor.tokenizer.additional_special_tokens_ids[
            processor.tokenizer.additional_special_tokens.index("<image>")
        ]

    def __call__(self, examples):
        texts = []
        for example in examples:
            messages = example["messages"]
            text = processor.apply_chat_template(messages, add_generation_prompt=False)
            texts.append(text.strip())

        batch = processor(text=texts, return_tensors="pt", padding=True)
        labels = batch["input_ids"].clone()
        labels[labels == processor.tokenizer.pad_token_id] = self.image_token_id
        batch["labels"] = labels
        return batch

data_collator = MyDataCollator(processor)
data_loader = DataLoader(train_dataset, batch_size=1, collate_fn=data_collator)

for batch in data_loader:
    batch = {k: v.to(DEVICE) for k, v in batch.items()}
    for k, v in batch.items():
        print(k, '->', v.shape)
    out = model(**batch)
    break

The console only prints:

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loading checkpoint shards: 100%|███████████████████████████████████████████████████| 7/7 [00:02<00:00,  2.93it/s]
---> loaded model
HuggingFaceM4 org

Hi @starzmustdie
can you say more about your setup?
i just did an inference on a 4k text sequence (no image) with 4bit quantization and torch.float16 on a 16GB V100

OS:

Distributor ID: Ubuntu
Description:    Ubuntu 22.04.3 LTS
Release:        22.04
Codename:       jammy

Installed packages:

Package                  Version
------------------------ -----------
accelerate               0.29.2
aiohttp                  3.9.4
aiosignal                1.3.1
annotated-types          0.6.0
anyio                    4.3.0
appdirs                  1.4.4
asttokens                2.4.1
attrs                    23.2.0
beautifulsoup4           4.12.2
bitsandbytes             0.43.1
certifi                  2024.2.2
cfgv                     3.4.0
charset-normalizer       3.3.2
click                    8.1.7
comm                     0.2.2
contourpy                1.2.1
cycler                   0.12.1
datasets                 2.18.0
debugpy                  1.8.1
decorator                5.1.1
dill                     0.3.8
distlib                  0.3.8
distro                   1.9.0
docker-pycreds           0.4.0
einops                   0.7.0
executing                2.0.1
filelock                 3.13.4
flash-attn               2.5.7
fonttools                4.51.0
frozenlist               1.4.1
fsspec                   2024.2.0
gitdb                    4.0.11
GitPython                3.1.43
h11                      0.14.0
httpcore                 1.0.5
httpx                    0.27.0
huggingface-hub          0.22.2
identify                 2.5.35
idna                     3.7
iniconfig                2.0.0
ipykernel                6.29.4
ipython                  8.23.0
jedi                     0.19.1
Jinja2                   3.1.3
joblib                   1.4.0
jupyter_client           8.6.1
jupyter_core             5.7.2
kiwisolver               1.4.5
lxml                     5.1.0
MarkupSafe               2.1.5
matplotlib               3.8.4
matplotlib-inline        0.1.7
more-itertools           10.2.0
mpmath                   1.3.0
multidict                6.0.5
multiprocess             0.70.16
nest-asyncio             1.6.0
networkx                 3.3
ninja                    1.11.1.1
nltk                     3.8.1
nodeenv                  1.8.0
numpy                    1.26.4
nvidia-cublas-cu12       12.1.3.1
nvidia-cuda-cupti-cu12   12.1.105
nvidia-cuda-nvrtc-cu12   12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12        8.9.2.26
nvidia-cufft-cu12        11.0.2.54
nvidia-curand-cu12       10.3.2.106
nvidia-cusolver-cu12     11.4.5.107
nvidia-cusparse-cu12     12.1.0.106
nvidia-nccl-cu12         2.19.3
nvidia-nvjitlink-cu12    12.4.127
nvidia-nvtx-cu12         12.1.105
openai                   1.7.2
packaging                24.0
pandas                   2.2.2
parso                    0.8.4
peft                     0.10.0
pexpect                  4.9.0
pillow                   10.3.0
pip                      24.0
platformdirs             4.2.0
pluggy                   1.4.0
pre-commit               3.4.0
prompt-toolkit           3.0.43
protobuf                 4.25.3
psutil                   5.9.8
ptyprocess               0.7.0
pure-eval                0.2.2
pyarrow                  15.0.2
pyarrow-hotfix           0.6
pydantic                 2.7.0
pydantic_core            2.18.1
Pygments                 2.17.2
pyparsing                3.1.2
pytest                   7.4.4
python-dateutil          2.9.0.post0
pytz                     2024.1
PyYAML                   6.0.1
pyzmq                    26.0.0
regex                    2023.12.25
requests                 2.31.0
safetensors              0.4.3
sentry-sdk               1.45.0
setproctitle             1.3.3
setuptools               65.5.0
six                      1.16.0
smmap                    5.0.1
sniffio                  1.3.1
soupsieve                2.5
stack-data               0.6.3
sympy                    1.12
tiktoken                 0.5.2
tokenizers               0.15.2
torch                    2.2.2
tornado                  6.4
tqdm                     4.66.2
traitlets                5.14.2
transformers             4.40.0.dev0
triton                   2.2.0
typing_extensions        4.11.0
tzdata                   2024.1
urllib3                  2.2.1
virtualenv               20.25.1
wandb                    0.16.6
wcwidth                  0.2.13
wheel                    0.43.0
xxhash                   3.4.1
yarl                     1.9.4
zss                      1.1.4

Nvidia-smi:

Wed Apr 17 18:54:00 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX 6000 Ada Gene...    On  | 00000000:01:00.0 Off |                  Off |
| 30%   33C    P8              29W / 300W |      5MiB / 49140MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX 6000 Ada Gene...    On  | 00000000:2D:00.0 Off |                  Off |
| 30%   49C    P8              34W / 300W |      5MiB / 49140MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA RTX 6000 Ada Gene...    On  | 00000000:41:00.0 Off |                  Off |
| 30%   41C    P8              26W / 300W |      5MiB / 49140MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA RTX 6000 Ada Gene...    On  | 00000000:61:00.0 Off |                  Off |
| 30%   45C    P8              25W / 300W |      5MiB / 49140MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Also, here is a flame graph of running the forward pass

  1. Using full parameters (no LoRA/QLoRA)
    full.png

  2. Using Qlora (which OOMs)
    qlora.png

(i can upload the pickles to the drive if it is any use to you)

Also, if I can provide any more details, please let me know.

Thank you

HuggingFaceM4 org

thanks for the details @starzmustdie . i will circle back if I need more info, it's on my todo for today to dig in.

in the meantime, perhaps that collab where I did an inference on less than 10GB of GPU mem will be useful https://colab.research.google.com/drive/1P8goWEyrScceBEMp4dD3eh2aCLMud_Su?usp=sharing

Regarding the issue of extremely slow time of instantiating a LoRA version of the model (but not a QLoRA), I was doing some benchmarking on this piece of the code peft library. Here is what I found:

  1. Loading Idefics2 with QLoRa is almost instantaneous. Here is the output of measuring the time:
(truncated output...)
Time for injecting adapter to model.text_model.layers.28.mlp.up_proj: 0.004160881042480469
Time for injecting adapter to model.text_model.layers.28.mlp.down_proj: 0.005509138107299805
Time for injecting adapter to model.text_model.layers.29.self_attn.q_proj: 0.002628326416015625
Time for injecting adapter to model.text_model.layers.29.self_attn.k_proj: 0.0020885467529296875
Time for injecting adapter to model.text_model.layers.29.self_attn.v_proj: 0.0025033950805664062
Time for injecting adapter to model.text_model.layers.29.self_attn.o_proj: 0.0025649070739746094
Time for injecting adapter to model.text_model.layers.29.mlp.gate_proj: 0.003892183303833008
Time for injecting adapter to model.text_model.layers.29.mlp.up_proj: 0.0038487911224365234
Time for injecting adapter to model.text_model.layers.29.mlp.down_proj: 0.0054149627685546875
Time for injecting adapter to model.text_model.layers.30.self_attn.q_proj: 0.0026171207427978516
Time for injecting adapter to model.text_model.layers.30.self_attn.k_proj: 0.002309083938598633
Time for injecting adapter to model.text_model.layers.30.self_attn.v_proj: 0.002630949020385742
Time for injecting adapter to model.text_model.layers.30.self_attn.o_proj: 0.002721071243286133
Time for injecting adapter to model.text_model.layers.30.mlp.gate_proj: 0.0038025379180908203
Time for injecting adapter to model.text_model.layers.30.mlp.up_proj: 0.0038831233978271484
Time for injecting adapter to model.text_model.layers.30.mlp.down_proj: 0.005049228668212891
Time for injecting adapter to model.text_model.layers.31.self_attn.q_proj: 0.002608776092529297
Time for injecting adapter to model.text_model.layers.31.self_attn.k_proj: 0.002328157424926758
Time for injecting adapter to model.text_model.layers.31.self_attn.v_proj: 0.0026826858520507812
Time for injecting adapter to model.text_model.layers.31.self_attn.o_proj: 0.0026001930236816406
Time for injecting adapter to model.text_model.layers.31.mlp.gate_proj: 0.003985166549682617
Time for injecting adapter to model.text_model.layers.31.mlp.up_proj: 0.0039288997650146484
Time for injecting adapter to model.text_model.layers.31.mlp.down_proj: 0.0049343109130859375
Average time for injecting adapter:  0.0027877162142497737
  1. Loading Idefics2 with LoRa is extremely slow. More precisely, it takes ~ 5 seconds to inject an adapter for some of the layers. This means that the model can be loading many many minutes:
Time for injecting adapter to model.vision_model.encoder.layers.23.self_attn.k_proj: 0.025675296783447266
Time for injecting adapter to model.vision_model.encoder.layers.23.self_attn.v_proj: 0.0254061222076416
Time for injecting adapter to model.vision_model.encoder.layers.23.self_attn.q_proj: 0.024999380111694336
Time for injecting adapter to model.vision_model.encoder.layers.23.self_attn.out_proj: 0.025428056716918945
Time for injecting adapter to model.vision_model.encoder.layers.23.mlp.fc1: 0.08384990692138672
Time for injecting adapter to model.vision_model.encoder.layers.23.mlp.fc2: 0.0839688777923584
Time for injecting adapter to model.vision_model.encoder.layers.24.self_attn.k_proj: 0.025103330612182617
Time for injecting adapter to model.vision_model.encoder.layers.24.self_attn.v_proj: 0.025132179260253906
Time for injecting adapter to model.vision_model.encoder.layers.24.self_attn.q_proj: 0.025215625762939453
Time for injecting adapter to model.vision_model.encoder.layers.24.self_attn.out_proj: 0.02485203742980957
Time for injecting adapter to model.vision_model.encoder.layers.24.mlp.fc1: 0.08424067497253418
Time for injecting adapter to model.vision_model.encoder.layers.24.mlp.fc2: 0.08370113372802734
Time for injecting adapter to model.vision_model.encoder.layers.25.self_attn.k_proj: 0.02559804916381836
Time for injecting adapter to model.vision_model.encoder.layers.25.self_attn.v_proj: 0.025536537170410156
Time for injecting adapter to model.vision_model.encoder.layers.25.self_attn.q_proj: 0.02513265609741211
Time for injecting adapter to model.vision_model.encoder.layers.25.self_attn.out_proj: 0.024949312210083008
Time for injecting adapter to model.vision_model.encoder.layers.25.mlp.fc1: 0.08578777313232422
Time for injecting adapter to model.vision_model.encoder.layers.25.mlp.fc2: 0.08488917350769043
Time for injecting adapter to model.vision_model.encoder.layers.26.self_attn.k_proj: 0.025366783142089844
Time for injecting adapter to model.vision_model.encoder.layers.26.self_attn.v_proj: 0.025299787521362305
Time for injecting adapter to model.vision_model.encoder.layers.26.self_attn.q_proj: 0.025112152099609375
Time for injecting adapter to model.vision_model.encoder.layers.26.self_attn.out_proj: 0.024931669235229492
Time for injecting adapter to model.vision_model.encoder.layers.26.mlp.fc1: 0.08368563652038574
Time for injecting adapter to model.vision_model.encoder.layers.26.mlp.fc2: 0.08527112007141113
Time for injecting adapter to model.connector.modality_projection.gate_proj: 0.2910897731781006
Time for injecting adapter to model.connector.modality_projection.up_proj: 0.28179073333740234
Time for injecting adapter to model.connector.modality_projection.down_proj: 4.413707256317139
Time for injecting adapter to model.connector.perceiver_resampler.layers.0.self_attn.q_proj: 0.4621694087982178
Time for injecting adapter to model.connector.perceiver_resampler.layers.0.self_attn.k_proj: 0.11468839645385742
Time for injecting adapter to model.connector.perceiver_resampler.layers.0.self_attn.v_proj: 0.11388301849365234
Time for injecting adapter to model.connector.perceiver_resampler.layers.0.self_attn.o_proj: 0.1406879425048828
Time for injecting adapter to model.connector.perceiver_resampler.layers.0.mlp.gate_proj: 4.9388508796691895  <---------
Time for injecting adapter to model.connector.perceiver_resampler.layers.0.mlp.up_proj: 4.939562559127808  <---------
Time for injecting adapter to model.connector.perceiver_resampler.layers.0.mlp.down_proj: 4.912921190261841  <---------
Time for injecting adapter to model.connector.perceiver_resampler.layers.1.self_attn.q_proj: 0.4630849361419678 
Time for injecting adapter to model.connector.perceiver_resampler.layers.1.self_attn.k_proj: 0.11741971969604492
Time for injecting adapter to model.connector.perceiver_resampler.layers.1.self_attn.v_proj: 0.11777377128601074
Time for injecting adapter to model.connector.perceiver_resampler.layers.1.self_attn.o_proj: 0.14248085021972656
Time for injecting adapter to model.connector.perceiver_resampler.layers.1.mlp.gate_proj: 4.919850587844849  <---------
Time for injecting adapter to model.connector.perceiver_resampler.layers.1.mlp.up_proj: 4.9399824142456055  <---------
Time for injecting adapter to model.connector.perceiver_resampler.layers.1.mlp.down_proj: 4.886034965515137  <---------
Time for injecting adapter to model.connector.perceiver_resampler.layers.2.self_attn.q_proj: 0.45731663703918457
Time for injecting adapter to model.connector.perceiver_resampler.layers.2.self_attn.k_proj: 0.11633014678955078
Time for injecting adapter to model.connector.perceiver_resampler.layers.2.self_attn.v_proj: 0.11535906791687012
Time for injecting adapter to model.connector.perceiver_resampler.layers.2.self_attn.o_proj: 0.13594913482666016
Time for injecting adapter to model.connector.perceiver_resampler.layers.2.mlp.gate_proj: 4.919170618057251   <---------
Time for injecting adapter to model.connector.perceiver_resampler.layers.2.mlp.up_proj: 4.907344579696655  <---------
Time for injecting adapter to model.connector.perceiver_resampler.layers.2.mlp.down_proj: 5.401084661483765  <---------
(continues, but truncated)

I have tried doing the same benchmarking on adding LoRA/QLoRa adapters to the base mistralai/Mistral-7B-v0.1 model, and the results are consistent with the above.

Why is this the case? When I was performing LoRa finetuning of Mistral in axolotl it definitely didn't take this long. Is there some new bug introduced?

HuggingFaceM4 org

with respect to the last comment on lora loading, do you have an idea of what could be happening @smangrul ? 🙏

@VictorSanh any luck reproducing the OOM issue? :)

@VictorSanh @smangrul Update: The reason why the loading of the LoRA model took so long compared to QLoRa is because of the flag use_dora=True in LoraConfig. 🤦‍♂️

HuggingFaceM4 org

@VictorSanh any luck reproducing the OOM issue? :)

no luck so far. i have memory usages that are significantly lower than what you are observing (these numbers are computed with the default example in the model card for idefics2-8b)

Screenshot 2024-04-19 at 11.55.02 AM.png

Sign up or log in to comment