mBERT swedish distilled base model (cased)
This model is a distilled version of mBERT. It was distilled using Swedish data, the 2010-2015 portion of the Swedish Culturomics Gigaword Corpus. The code for the distillation process can be found here. This was done as part of my Master's Thesis: Task-agnostic knowledge distillation of mBERT to Swedish.
Model description
This is a 6-layer version of mBERT, having been distilled using the LightMBERT distillation method, but without freezing the embedding layer.
Intended uses & limitations
You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to be fine-tuned on a downstream task.
Training data
The data used for distillation was the 2010-2015 portion of the Swedish Culturomics Gigaword Corpus. The tokenized data had a file size of approximately 9 GB.
Evaluation results
When evaluated on the SUCX 3.0 dataset, it achieved an average F1 score of 0.859 which is competitive with the score mBERT obtained, 0.866.
When evaluated on the English WikiANN dataset, it achieved an average F1 score of 0.826 which is competitive with the score mBERT obtained, 0.849.
Additional results and comparisons are presented in my Master's Thesis
- Downloads last month
- 8