Unscale FP16 Gradients Help

#5
by CalebRBP - opened

Hello,

I'm trying to train a model using the Trainer from the Transformers library. I am using a quantized model with FP16 optimization, but during training, I encounter the error ValueError: Attempting to unscale FP16 gradients..

Here is my code:

import transformers
from torch.nn import CrossEntropyLoss
from transformers import AutoTokenizer
from datasets import load_dataset

Define your model and tokenizer (these should already be defined in your code)

MODEL_NAME = "vilsonrodrigues/falcon-7b-instruct-sharded"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token

Load your data

data = load_dataset('csv', data_files='/content/Sumoquote Training Database.csv')

Define your tokenizer function

def tokenize_and_format(examples):
# Here, I'm assuming that the 'User' and 'Prompt' fields in your CSV contains the text you want to model.
text = [f"{x} {y}" for x, y in zip(examples['User'], examples['Prompt'])]
tokenized = tokenizer(text, truncation=True, padding='max_length')

# Format the data for causal language modeling
tokenized['labels'] = tokenized['input_ids'].copy()
tokenized['input_ids'] = [ids[:-1] for ids in tokenized['input_ids']]
tokenized['labels'] = [ids[1:] for ids in tokenized['labels']]

return tokenized

Apply the tokenizer function to your data

data = data.map(tokenize_and_format, batched=True)
data.set_format(type='torch', columns=['input_ids', 'labels'])

Define the training arguments

training_args = transformers.TrainingArguments(
per_device_train_batch_size=1,
gradient_accumulation_steps=4,
num_train_epochs=1,
learning_rate=2e-4,
fp16=True,
save_total_limit=3,
logging_steps=1,
output_dir="experiments",
optim="adamw_8bit",
lr_scheduler_type="cosine",
warmup_ratio=0.05,
)

Define the callback

class EnsureGradsAreFP32(transformers.TrainerCallback):
def on_backward_end(self, args, state, control, **kwargs):
if args.fp16:
for param in model.parameters():
if param.grad is not None:
param.grad.data = param.grad.data.float()

Create the Trainer

trainer = transformers.Trainer(
model=model,
train_dataset=data['train'], # Here, I've used the Dataset
args=training_args,
data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
callbacks=[EnsureGradsAreFP32()]
)

Disable caching

model.config.use_cache = False

Train the model

trainer.train()

Things I've tried:

-Disabling gradient accumulation.
-Changing the optimizer to "adamw_8bit".
-Making sure all gradients are in FP32 before calling optimizer.step().
-Disabling caching.

Despite these efforts, the problem still persists. Any guidance would be greatly appreciated.

how do you define your model?

what is your Transformers version?

apparently your code is correct and should work. I recommend opening an issue on Github showing the complete code

Sign up or log in to comment