Loss = 0 and Gradient = NaN in ModernBERT Fine-Tuning for Regression
#63
by
saran1999
- opened
I am facing an issue while fine-tuning ModernBERT for a regression task. I get a loss of 0 and NaN gradients, but this problem does not occur when using BERT. I have pre-trained this model from scratch on my domain dataset.
- Flash attention is disabled.
- Tried changing FP16 to True and False, problem still occurs.
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 1.0000000000000002e-06, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 2.0000000000000003e-06, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 3e-06, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 4.000000000000001e-06, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 5e-06, 'epoch': 0.0}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 6e-06, 'epoch': 0.0}
Model Architecture:
class ModernBertForRegression(ModernBertPreTrainedModel):
def __init__(self, config):
super().__init__(config)
self.model = ModernBertModel(config)
self.regressor = nn.Linear(config.hidden_size, 1)
self.init_weights()
def forward(self, input_ids=None, attention_mask=None, labels=None):
outputs = self.model(input_ids=input_ids, attention_mask=attention_mask)
pooled_output = outputs[0][:, 0, :] # CLS token embedding
predictions = self.regressor(pooled_output)
loss = None
if labels is not None:
loss_fct = nn.MSELoss()
loss = loss_fct(predictions.squeeze(), labels)
return loss, predictions
Training Arguments:
training_args = TrainingArguments(
output_dir="./results",
num_train_epochs=3,
per_device_train_batch_size=2,
learning_rate=1e-4,
fp16=True,
logging_dir="./logs",
)
Would appreciate any suggestions on why this could be happening...
This should probably fix your issue
class ModernBertForRegression(ModernBertPreTrainedModel):
def __init__(self, config):
super().__init__(config)
self.model = ModernBertModel(config)
self.head = ModernBertPredictionHead(config)
self.regressor = nn.Linear(config.hidden_size, 1)
self.init_weights()
def forward(self, input_ids=None, attention_mask=None, labels=None):
outputs = self.model(input_ids=input_ids, attention_mask=attention_mask)
last_hidden_state_cls = outputs[0][:, 0]
pooled_output = self.head(last_hidden_state_cls)
predictions = self.regressor(pooled_output)
loss = None
if labels is not None:
loss_fct = nn.MSELoss()
loss = loss_fct(predictions.squeeze(), labels)
return loss, predictions
This comment has been hidden
saran1999
changed discussion status to
closed
saran1999
changed discussion status to
open
It seems like the issue was with the regression head, the weights were not initialized properly for my task. So nothing was wrong with ModernBERT.
saran1999
changed discussion status to
closed