ModernBert in Encoder-Decoder -> "got an unexpected keyword argument 'inputs_embeds'"
Hello,
I am trying to train an encoder-decoder model that uses ModernBert as the encoder and GPT2 as the decoder. I had hoped that this would be straightforward enough using HF provided classes/trainers for Seq2Seq but have run into an error I have not been able to fix. Currently I do the following -
tokenizer_MBert = AutoTokenizer.from_pretrained("answerdotai/ModernBERT-base", device_map = 'cuda:0')
model = EncoderDecoderModel.from_encoder_decoder_pretrained("answerdotai/ModernBERT-base", "gpt2",
pad_token_id=tokenizer_MBert.eos_token_id,
device_map = 'cuda:0')
model.decoder.config.use_cache = False
model.gradient_checkpointing_enable()
tokenizer_MBert.bos_token = tokenizer_MBert.cls_token
tokenizer_MBert.eos_token = tokenizer_MBert.sep_token
tokenizer_MBert.pad_token = tokenizer_MBert.unk_token
def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
outputs = [self.bos_token_id] + token_ids_0 + [self.eos_token_id]
return outputs
GPT2Tokenizer.build_inputs_with_special_tokens = build_inputs_with_special_tokens
gpt2_tokenizer = GPT2Tokenizer.from_pretrained("gpt2", device_map = 'cuda:0')
gpt2_tokenizer.pad_token = gpt2_tokenizer.unk_token
model.config.decoder_start_token_id = gpt2_tokenizer.bos_token_id
model.config.pad_token_id = tokenizer_MBert.pad_token_id
model.config.eos_token_id = gpt2_tokenizer.eos_token_id
model.config.no_repeat_ngram_size = 3
model.early_stopping = True
model.length_penalty = 3.0
model.num_beams = 2
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer_MBert, model=model)
optimizer = 'adamw_torch'
lr_scheduler = 'linear'
training_args = Seq2SeqTrainingArguments(
output_dir="./MBert_GPT2",
eval_strategy="steps",
eval_steps=2000,
save_strategy="steps",
save_steps=2000,
logging_steps=100,
max_steps=10000,
do_eval=True,
optim=optimizer,
gradient_checkpointing=True,
gradient_checkpointing_kwargs={'use_reentrant':False},
learning_rate=2e-5,
log_level="debug",
per_device_train_batch_size=20,
per_device_eval_batch_size=20,
lr_scheduler_type=lr_scheduler,
bf16=True,
report_to="wandb",
run_name="MBert_GPT2",
seed=42,
predict_with_generate=True,
generation_max_length=300
)
trainer = Seq2SeqTrainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset['train'],
eval_dataset=tokenized_dataset['test'],
tokenizer=tokenizer_MBert,
data_collator=data_collator,
)
trainer.train()
The tokenized_dataset contains the input_ids and labels.
This is also using the latest version of transformers right from their git page.
The training is started using notebook_launcher from accelerate and then it gives this error -
TypeError: ModernBertModel.forward() got an unexpected keyword argument 'inputs_embeds'
I have looked at the modernBert forward code and have seen that it indeed does not take in inputs_embeds as an input, but I was under the impression that since I was providing the input_ids, no input_embds should have been passed through during the training. I am not sure if ModernBert is not meant to be used in an Encoder-Decoder setup or if I have just implemented it incorrectly.
I do believe that the issue occurs when EncoderDecoderModel attempts to calculate loss since I am able to generate using the EncoderDecoderModel using input_ids but get the error when attempting to calculate loss.
Any help would be appreciated.
Did you solve it?
@khusrav13 Yes it is solved, there was a gitpull request outlining the same issue - https://github.com/huggingface/transformers/pulls?q=inputs_embeds which was solved earlier and since then it has worked
@KaranShishoo , I'm exploring whether it's possible to build a sequence-to-sequence model in the Transformers library using ModernBERT as the encoder and Llama 3.1 (8B) as the decoder. I attempted this setup a few days ago but ran into issues, by running EncoderDecoderModel class in Hugging Face Transformers . Do you know of any working examples or tutorials that demonstrate how to configure this type of seq2seq pipeline? Thank you!