CUDA out of memory on A100 with 40GB

by SkalskiP - opened Apr 16, 2024

Apr 16, 2024

Hi, I tried to run the code example from this blogpost: https://huggingface.co/blog/idefics2 on A100 Colab but failed at generated_ids = model.generate(**inputs, max_new_tokens=500). Is there any way to optimize the inference?

VictorSanh

Apr 16, 2024

hi @sagnak , you could deactivate the image splitting do save some memory -> processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics2-8b", do_image_splitting=False)

VictorSanh

Apr 16, 2024

see https://huggingface.co/HuggingFaceM4/idefics2-8b#model-optimizations for more info

VictorSanh

Apr 16, 2024

and since you are using an A100, you should definitely load in bf16:

model = AutoModelForVision2Seq.from_pretrained(
    "HuggingFaceM4/idefics2-8b",
    torch_dtype=torch.bfloat16,
).to(DEVICE)

nielsr

Apr 17, 2024

I assume you can also load the weights in 4-bit:

from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_4bit=True)

model = AutoModelForVision2Seq.from_pretrained(
    "HuggingFaceM4/idefics2-8b",
    quantization_config=quantization_config,
    device_map="auto",
)

See the blog post for more details: https://huggingface.co/blog/4bit-transformers-bitsandbytes. Btw @SkalskiP there's a chat version coming soon, this model has only undergone supervised fine-tuning (SFT), so it can't be compared directly to llava for instance (unless you want to compute metrics on multimodal benchmarks)

starzmustdie

Apr 17, 2024

I implemented the above optimizations, but I still get OOM on A6000 GPU with 48 GB VRAM:

import requests
import torch
from PIL import Image

from transformers import AutoProcessor, AutoModelForVision2Seq
from transformers.image_utils import load_image
from transformers import BitsAndBytesConfig

DEVICE = "cuda:0"

# Note that passing the image urls (instead of the actual pil images) to the processor is also possible
image1 = load_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg")
image2 = load_image("https://cdn.britannica.com/59/94459-050-DBA42467/Skyline-Chicago.jpg")
image3 = load_image("https://cdn.britannica.com/68/170868-050-8DDE8263/Golden-Gate-Bridge-San-Francisco.jpg")


quantization_config = BitsAndBytesConfig(load_in_4bit=True)
processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics2-8b", do_image_splitting=False)
model = AutoModelForVision2Seq.from_pretrained(
    "HuggingFaceM4/idefics2-8b",
    torch_dtype=torch.bfloat16,
    quantization_config=quantization_config,
)


# Create inputs
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "What do we see in this image?"},
        ]
    },
    {
        "role": "assistant",
        "content": [
            {"type": "text", "text": "In this image, we can see the city of New York, and more specifically the Statue of Liberty."},
        ]
    },
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "And how about this image?"},
        ]
    },
]
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image1, image2], return_tensors="pt")
inputs = {k: v.to(DEVICE) for k, v in inputs.items()}


# Generate
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)

print(generated_texts)

Error:

OutOfMemoryError: CUDA out of memory. Tried to allocate 874.00 MiB. GPU 0 has a total capacity of 47.30 GiB of which 861.44 MiB is free. Including non-PyTorch memory, this process has 46.44 GiB memory in use. Of the allocated memory 45.59 GiB is allocated by PyTorch, and 356.39 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

VictorSanh

Apr 17, 2024

•

edited Apr 17, 2024

hi @starzmustdie , i just used your exact code snippet (modulo the pip installs) on a 16GB V100 and it's spiking at 9.5GB GPU memory... can you say more about your setup?

VictorSanh

Apr 19, 2024

@starzmustdie et al,
i updated the section https://huggingface.co/HuggingFaceM4/idefics2-8b#model-optimizations to include some benchmarks on how to run idefics2 with very little GPU memory.
TLDR: there are plenty of low-lift setups that require less than 16GB GPU memory to run inference.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment