Performing MLM pretraining on BERT pretrained model to use model in Sentence Transformer for semantic similarity

#2
by mbecuwe - opened

I have a NLP use case to compute semantic similarity between sentences that are very specific to my use case.
I want to use Sentence Transformers library to do this, which provides with state of the art result for this goal.

I have a BERT model specifically trained for the sBERT task and I know I can finetune the model with pair of sentences as inputs and similarity score as labels.
However, I would also like to continue BERT pretraining with Mask Language Modeling task on this model.
Does it make sense to instantiate a BertForMaskedLM object from this model already trained for sentence transformer task in order to continue its pretraining, and then load it as a SentenceTransformer model to finetune it on sentence pairs?

I would do as such, with example on Camembert French NLP model from huggingface :

For the MLM part:

from transformers import CamembertTokenizer, CamembertForMaskedLM, LineByLineTextDataset, DataCollatorForLanguageModeling, Trainer, TrainingArguments


tokenizer = CamembertTokenizer.from_pretrained("dangvantuan/sentence-camembert-large")
model = CamembertForMaskedLM.from_pretrained("dangvantuan/sentence-camembert-large")

dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path=LOCAL_DATASET_PATH,
    block_size=512
)

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

training_args = TrainingArguments(
    output_dir=LOCAL_MODEL_PATH,
    overwrite_output_dir=True,
    num_train_epochs=25,
    save_steps=500,
    save_total_limit=2,
    seed=1,
    auto_find_batch_size=True
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
)

trainer.train()

trainer.save_model(LOCAL_MODEL_PATH + "/my_model")

To get it as SentenceTransformer model:

from sentence_transformers import SentenceTransformer, models

word_embedding_model = models.Transformer(
LOCAL_MODEL_PATH + "/my_model", 
tokenizer_name_or_path=tokenizer_path, 
max_seq_length=max_seq_length
)
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
model = SentenceTransformer(modules=[word_embedding_model, pooling_model])

Thanks !

Sign up or log in to comment