LongT5 fails in FP16 mode

#1
by ArthurCamara - opened

Basically what the title says.
If I initiate a T5 model and place it on CPU or CUDA, it works as intended:

from transformers import AutoTokenizer, LongT5EncoderModel
model = LongT5EncoderModel.from_pretrained("google/long-t5-tglobal-base")
tokenizer = AutoTokenizer.from_pretrained("google/long-t5-tglobal-base")
inputs = tokenizer("<extra_id_0>  Hello, my dog is cute", return_tensors="pt")
model(**inputs)
Out[19]:
BaseModelOutputWithPastAndCrossAttentions(last_hidden_state=tensor([[[-0.1624, -0.1439,  0.0011,  ..., -0.2340,  0.2113,  0.1893],
         [ 0.0917,  0.0566,  0.0013,  ...,  0.0890,  0.2563, -0.1880],
         [ 0.0172, -0.0204,  0.0013,  ...,  0.0048, -0.0575, -0.0638],
         ...,
         [-0.0219, -0.0702,  0.0009,  ..., -0.0568, -0.0474,  0.0188],
         [-0.0876,  0.0266,  0.0008,  ...,  0.0385,  0.0675,  0.2390],
         [-0.0128, -0.0052, -0.0009,  ..., -0.0212,  0.0151, -0.0093]]],
       grad_fn=<MulBackward0>), past_key_values=None, hidden_states=None, attentions=None, cross_attentions=None)

That's as expected. The same behaviour happens on a GPU:

## From GPU
model = model.to("cuda")
for k, v in inputs.items():
        inputs[k] = v.to("cuda")
model(**inputs)
Out[22]:
BaseModelOutputWithPastAndCrossAttentions(last_hidden_state=tensor([[[-0.1624, -0.1439,  0.0011,  ..., -0.2340,  0.2113,  0.1893],
         [ 0.0917,  0.0566,  0.0013,  ...,  0.0890,  0.2563, -0.1880],
         [ 0.0172, -0.0204,  0.0013,  ...,  0.0048, -0.0575, -0.0638],
         ...,
         [-0.0219, -0.0702,  0.0009,  ..., -0.0568, -0.0474,  0.0188],
         [-0.0876,  0.0266,  0.0008,  ...,  0.0385,  0.0675,  0.2390],
         [-0.0128, -0.0052, -0.0009,  ..., -0.0212,  0.0151, -0.0093]]],
       device='cuda:0', grad_fn=<MulBackward0>), past_key_values=None, hidden_states=None, attentions=None, cross_attentions=None)

But if call half() on the model, it only returns nans:

model = model.half()
model(**inputs)
Out[24]:
BaseModelOutputWithPastAndCrossAttentions(last_hidden_state=tensor([[[nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         ...,
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan],
         [nan, nan, nan,  ..., nan, nan, nan]]], device='cuda:0',
       dtype=torch.float16, grad_fn=<MulBackward0>), past_key_values=None, hidden_states=None, attentions=None, cross_attentions=None)

Removing the <extra_id_0> doesn't really help either.
Any ideas on what is causing this?

Nevermind. Switching to main instead of release solves this.
EDIT: No it didn't =(
Using BF16 solved

ArthurCamara changed discussion status to closed

Yup. BF16 solved.

Sign up or log in to comment