Loss = 0 and Gradient = NaN in ModernBERT Fine-Tuning for Regression

#63

by saran1999 - opened Feb 1

saran1999

Feb 1

I am facing an issue while fine-tuning ModernBERT for a regression task. I get a loss of 0 and NaN gradients, but this problem does not occur when using BERT. I have pre-trained this model from scratch on my domain dataset.

Flash attention is disabled.
Tried changing FP16 to True and False, problem still occurs.

{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 1.0000000000000002e-06, 'epoch': 0.0}                                                                                              
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 2.0000000000000003e-06, 'epoch': 0.0}                                                                                              
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 3e-06, 'epoch': 0.0}                                                                                                               
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 4.000000000000001e-06, 'epoch': 0.0}                                                                                               
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 5e-06, 'epoch': 0.0}                                                                                                               
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 6e-06, 'epoch': 0.0}

Model Architecture:

class ModernBertForRegression(ModernBertPreTrainedModel):
    def __init__(self, config):
        super().__init__(config)
        self.model = ModernBertModel(config)
        self.regressor = nn.Linear(config.hidden_size, 1)
        self.init_weights()

    def forward(self, input_ids=None, attention_mask=None, labels=None):
        outputs = self.model(input_ids=input_ids, attention_mask=attention_mask)
        pooled_output = outputs[0][:, 0, :]  # CLS token embedding
        predictions = self.regressor(pooled_output)

        loss = None
        if labels is not None:
            loss_fct = nn.MSELoss()
            loss = loss_fct(predictions.squeeze(), labels)

        return loss, predictions

Training Arguments:

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=2,
    learning_rate=1e-4,
    fp16=True,
    logging_dir="./logs",
)

Would appreciate any suggestions on why this could be happening...

Tarok6

Feb 1

This should probably fix your issue

class ModernBertForRegression(ModernBertPreTrainedModel):
    def __init__(self, config):
        super().__init__(config)
        self.model = ModernBertModel(config)
        self.head = ModernBertPredictionHead(config)
        self.regressor = nn.Linear(config.hidden_size, 1)
        self.init_weights()

    def forward(self, input_ids=None, attention_mask=None, labels=None):
        outputs = self.model(input_ids=input_ids, attention_mask=attention_mask)
        last_hidden_state_cls = outputs[0][:, 0]
        pooled_output = self.head(last_hidden_state_cls)
        predictions = self.regressor(pooled_output)

        loss = None
        if labels is not None:
            loss_fct = nn.MSELoss()
            loss = loss_fct(predictions.squeeze(), labels)

        return loss, predictions

saran1999

Feb 1

This comment has been hidden

saran1999 changed discussion status to closed Feb 1

saran1999 changed discussion status to open Feb 1

saran1999

Feb 1

Hi @Tarok6 , thanks for your suggestion. I tried it out but I still get the same result.

I also tried updating my torch to 2.6.0 from 2.5.1, still getting the same result....

saran1999

Feb 7

It seems like the issue was with the regression head, the weights were not initialized properly for my task. So nothing was wrong with ModernBERT.

saran1999 changed discussion status to closed Feb 7

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment