Cuda error under modelling_llava.py

#6
by gullalc - opened

python3.8/site-packages/transformers/models/llava/modeling_llava.py", line 428, in forward
extended_attention_mask[batch_index, non_attended_tokens] = 0
RuntimeError: CUDA error: device-side assert triggered
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Error reproduced with CUDA_LAUNCH_BLOCKING=1.

Is this specific to new llava model on huggingface?

Llava Hugging Face org

Hi @gullal1491
Can you share a reproducible snippet?

Model is loaded as:

## Load model
model = LlavaForConditionalGeneration.from_pretrained(
        args.model_name, 
        torch_dtype=torch.float16, 
        low_cpu_mem_usage=True,
        use_flash_attention_2=True
    ).to(device)

processor = AutoProcessor.from_pretrained(args.model_name)

prompt = "USER: <image>\n%s\nASSISTANT:"%(config["prompts"][args.prompt])

Images are processed in a batch:

image_descriptions = {}

for i, batch in enumerate(loader):
        print(i+1)

        image_ids, prompts, image_ids = batch

        images = [Image.open(img_path).convert("RGB") for img_path in image_ids]

        inputs = processor(prompts, images, return_tensors='pt').to(device, torch.float16)

        outputs = model.generate(**inputs, max_new_tokens=100, do_sample=False)

        generated_text = processor.batch_decode(outputs, skip_special_tokens=True)

        for img_id, text in zip(image_ids, generated_text):
            image_descriptions[img_id] = text.split("ASSISTANT:")[-1]

It works perfectly for one prompt over a set of 1000 images. But with a different prompt it fails for the same batch of images.

Prompts used:
one: "Write a short description for the image."
two: "Write a detailed description for the image."
three: "Write a descriptive caption for the image by focusing on entities and relations present in it."

It gives the same error for prompt two and three. Somehow it has to do with length of text produced by the model during generation.

Llava Hugging Face org

@gullal1491
Thanks! Can you try to run the problematic generation on CPU instead of GPU? Then the issue would be clearer - meanwhile I will try to reproduce based on your script

Interesting that there is no error when running on CPU.

Load model

model = LlavaForConditionalGeneration.from_pretrained(
    args.model_name,
    low_cpu_mem_usage=True
).to(device)

Could this be because of other libraries like accelerate or running the model with half precision on GPU?
Any help is appreciated in getting close to fixing this issue.

Could batched inference be an issue? I did try to run with batch size 1 and still encountered the same error.

LLava 1.5 batched inference issue: https://github.com/haotian-liu/LLaVA/issues/709

Llava Hugging Face org

We explicitly test batched generation here: https://github.com/huggingface/transformers/blob/main/tests/models/llava/test_modeling_llava.py#L245, so it should work out of the box. Could you try running the code snippet present there?

Thanks @nielsr for the response.

I did try following it line by line, but unfortunately for a specific sample or two it fails. I am trying to debug based on the error trace and could identify this out of bounds error in modelling_llava.py.

On this line where error occurs:
extended_attention_mask[batch_index, non_attended_tokens] = 0

The error is:

`/opt/conda/conda-bld/pytorch_1695392020195/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [0,0,0], thread: [3,0,0] Assertion `-sizes[i] <= index && index < sizes[i] && "index out of bounds"` failed.`

As it is an indexing error, there is a value in non_attended_tokens which is greater than the length of extended_attention_mask.
In this particular batch, where it fails:
non_attended_tokens looks like: tensor([ 84, 128, 216, 612, 549, 571, 571, 238, 505, 81]
where as shape of extended_attention_mask is torch.Size([128, 575])

I don't know why exactly this is happening.

Llava Hugging Face org

hmmm it might be a bug then, can you help us reproducing this bug by sending us here the problematic images and prompts?

Sure @ybelkada . Where should I send them? I cannot share the images via public links. But I can share them privately.

Another thing I noticed with token ids is this:

print(model.config.text_config.pad_token_id)  -> None
print(processor.tokenizer.pad_token_id)  -> 32001

Is this intended? Although making text_config pad_token_id the same as tokenizer (as suggested in some discussions on instructBLIP batched inference errors), does not resolve the above error.

I avoided this out of bounds error for now by ensuring indices are within bounds. It is not changing any result in generation.
Would dig deeper later, as to why this index was in the first place in non_attended_tokens.

For now, I am dealing with this by avoiding that index:

# Ensuring indices are within bounds
valid_indices = non_attended_tokens < extended_attention_mask.shape[1]
new_batch_index = batch_index[valid_indices]
new_non_attended_tokens = non_attended_tokens[valid_indices]
Llava Hugging Face org

Hi @gullal1491
This seems like a valid fix and we've seen similar issues on BakLlava as well, would you mind opening a PR to introduce this fix?

Llava Hugging Face org

Hi @gullal1491
Thanks again for the investigation and the fix, as this might affect other usres in the future, I quickly made https://github.com/huggingface/transformers/pull/28032 and make sure to add you as a co-author of the fix! Thanks again for all your help

Llava Hugging Face org

This is now fixed on transformers main! Closing this issue - thanks a lot @gullal1491 for everything!

ybelkada changed discussion status to closed

Sign up or log in to comment