Performing MLM pretraining on BERT pretrained model to use model in Sentence Transformer for semantic similarity
I have a NLP use case to compute semantic similarity between sentences that are very specific to my use case.
I want to use Sentence Transformers library to do this, which provides with state of the art result for this goal.
I have a BERT model specifically trained for the sBERT task and I know I can finetune the model with pair of sentences as inputs and similarity score as labels.
However, I would also like to continue BERT pretraining with Mask Language Modeling task on this model.
Does it make sense to instantiate a BertForMaskedLM object from this model already trained for sentence transformer task in order to continue its pretraining, and then load it as a SentenceTransformer model to finetune it on sentence pairs?
I would do as such, with example on Camembert French NLP model from huggingface :
For the MLM part:
from transformers import CamembertTokenizer, CamembertForMaskedLM, LineByLineTextDataset, DataCollatorForLanguageModeling, Trainer, TrainingArguments
tokenizer = CamembertTokenizer.from_pretrained("dangvantuan/sentence-camembert-large")
model = CamembertForMaskedLM.from_pretrained("dangvantuan/sentence-camembert-large")
dataset = LineByLineTextDataset(
tokenizer=tokenizer,
file_path=LOCAL_DATASET_PATH,
block_size=512
)
data_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)
training_args = TrainingArguments(
output_dir=LOCAL_MODEL_PATH,
overwrite_output_dir=True,
num_train_epochs=25,
save_steps=500,
save_total_limit=2,
seed=1,
auto_find_batch_size=True
)
trainer = Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=dataset,
)
trainer.train()
trainer.save_model(LOCAL_MODEL_PATH + "/my_model")
To get it as SentenceTransformer model:
from sentence_transformers import SentenceTransformer, models
word_embedding_model = models.Transformer(
LOCAL_MODEL_PATH + "/my_model",
tokenizer_name_or_path=tokenizer_path,
max_seq_length=max_seq_length
)
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
model = SentenceTransformer(modules=[word_embedding_model, pooling_model])
Thanks !