Loss = 0 and Gradient = NaN in ModernBERT Fine-Tuning for Regression

#63
by saran1999 - opened

I am facing an issue while fine-tuning ModernBERT for a regression task. I get a loss of 0 and NaN gradients, but this problem does not occur when using BERT. I have pre-trained this model from scratch on my domain dataset.

  • Flash attention is disabled.
  • Tried changing FP16 to True and False, problem still occurs.
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 1.0000000000000002e-06, 'epoch': 0.0}                                                                                              
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 2.0000000000000003e-06, 'epoch': 0.0}                                                                                              
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 3e-06, 'epoch': 0.0}                                                                                                               
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 4.000000000000001e-06, 'epoch': 0.0}                                                                                               
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 5e-06, 'epoch': 0.0}                                                                                                               
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 6e-06, 'epoch': 0.0}                     

Model Architecture:

class ModernBertForRegression(ModernBertPreTrainedModel):
    def __init__(self, config):
        super().__init__(config)
        self.model = ModernBertModel(config)
        self.regressor = nn.Linear(config.hidden_size, 1)
        self.init_weights()

    def forward(self, input_ids=None, attention_mask=None, labels=None):
        outputs = self.model(input_ids=input_ids, attention_mask=attention_mask)
        pooled_output = outputs[0][:, 0, :]  # CLS token embedding
        predictions = self.regressor(pooled_output)

        loss = None
        if labels is not None:
            loss_fct = nn.MSELoss()
            loss = loss_fct(predictions.squeeze(), labels)

        return loss, predictions

Training Arguments:

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=2,
    learning_rate=1e-4,
    fp16=True,
    logging_dir="./logs",
)

Would appreciate any suggestions on why this could be happening...

This should probably fix your issue

class ModernBertForRegression(ModernBertPreTrainedModel):
    def __init__(self, config):
        super().__init__(config)
        self.model = ModernBertModel(config)
        self.head = ModernBertPredictionHead(config)
        self.regressor = nn.Linear(config.hidden_size, 1)
        self.init_weights()

    def forward(self, input_ids=None, attention_mask=None, labels=None):
        outputs = self.model(input_ids=input_ids, attention_mask=attention_mask)
        last_hidden_state_cls = outputs[0][:, 0]
        pooled_output = self.head(last_hidden_state_cls)
        predictions = self.regressor(pooled_output)

        loss = None
        if labels is not None:
            loss_fct = nn.MSELoss()
            loss = loss_fct(predictions.squeeze(), labels)

        return loss, predictions
This comment has been hidden
saran1999 changed discussion status to closed
saran1999 changed discussion status to open

Hi @Tarok6 , thanks for your suggestion. I tried it out but I still get the same result.

I also tried updating my torch to 2.6.0 from 2.5.1, still getting the same result....

It seems like the issue was with the regression head, the weights were not initialized properly for my task. So nothing was wrong with ModernBERT.

saran1999 changed discussion status to closed
Your need to confirm your account before you can post a new comment.

Sign up or log in to comment