--- license: mit datasets: - wikipedia language: - en metrics: - glue --- # Model Card for SzegedAI/bert-medium-mlsm This medium-sized BERT model was created using the [Masked Latent Semantic Modeling] (MLSM) pre-training objective, which is a sample efficient alternative for classic Masked Language Modeling (MLM). During MLSM, the objective is to recover the latent semantic profile of the masked tokens, as opposed to recovering their exact identity. The contextualized latent semantic profile during pre-training is determined by performing sparse coding of the hidden representation of an already pre-trained model (a base-sized BERT model in this particular case). ## Model Details ### Model Description - **Developed by:** SzegedAI - **Model type:** transformer encoder - **Language:** Engish - **License:** MIT ### Model Sources - **Repository:** [https://github.com/szegedai/MLSM](https://github.com/szegedai/MLSM) - **Paper:** [Masked Latent Semantic Modeling: an Efficient Pre-training Alternative to Masked Language Modeling](https://underline.io/events/395/posters/15279/poster/78046-masked-latent-semantic-modeling-an-efficient-pre-training-alternative-to-masked-language-modeling?tab=abstract+%26+voting) ## How to Get Started with the Model The pre-trained model can be used in the usual manner, e.g., for fine tuning on a particular sequence classification task, invoke the code: ``` from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained('SzegedAI/bert-medium-mlsm') model = AutoModelForSequenceClassification.from_pretrained('SzegedAI/bert-medium-mlsm') ``` ## Training Details ### Training Data The model was pre-trained using a 2022 English Wikipedia dump pre-processed with [wiki-bert-pipeline](https://github.com/spyysalo/wiki-bert-pipeline). ### Training Procedure #### Preprocessing #### Training Hyperparameters Pre-training was conducted with a batch size of 32 sequences and a gradient accumulation over 32 batches, resulting in an effective batch size of 1024. A total of 300,000 update steps were performed using the AdamW optimizer with a linear learning rate scheduling having a peak learning rate of 1e-04. A maximum sequence length of 128 tokens was employed over the first 90% of the pre-training, while for the final 10% of the pre-training, the maximum sequence length was increased to 512 tokens. - **Training regime:** fp32 ## Evaluation #### Metrics The model was evaluated on GLUE tasks and CoNLL2003 for named entity recognition. ### Results The evaluation result after fine-tuning the given model on a wide range of tasks. On each tasks 10 different fine-tuning were performed, during which the only difference was the random initialization of the task-specific classification head. Both the average and the standard deviation are displayed below on each tasks. | Dataset | Metric | Avg. | Std. | |---|---|---|---| | CoLA | Matthews correlation | 0.403 | 0.012 | | CoNLL2003 | F1 | 0.926 | 0.003 | | MNLI (matched) | Accuracy | 0.798 | 0.001 | | MNLI (mismatched) | Accuracy | 0.808 | 0.002 | | MRPC | Accuracy | 0.786 | 0.020 | | MRPC | F1 | 0.851 | 0.013 | | QNLI | Accuracy | 0.870 | 0.004 | | QQP | Accuracy | 0.892 | 0.001 | | QQP | F1 | 0.855 | 0.001 | | RTE | Accuracy | 0.571 | 0.011 | | SST2 | Accuracy | 0.905 | 0.004 | | STSB | Pearson correlation | 0.818 | 0.024 | | STSB | Spearman correlation | 0.820 | 0.021 | | WiC | Accuracy | 0.639 | 0.007 | | Average | --- | 0.7815 | --- | #### Summary This model was more sample efficient and reached practically the same average performance as an alternatively pre-trained language model of 2.5 times more parameter (of base size) that was pre-trained using the classical MLM objective. ## Environmental Impact Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). - **Hardware Type:** RTX A6000 - **Hours used:** 300 - **Carbon Emitted:** 42 kg CO2 eq. ## Citation The pre-training objective is introduced in the ACL Findings paper _Masked Latent Semantic Modeling: an Efficient Pre-training Alternative to Masked Language Modeling_. **BibTeX:**