Edit model card

mBERT swedish distilled base model (cased)

This model is a distilled version of mBERT. It was distilled using Swedish data, the 2010-2015 portion of the Swedish Culturomics Gigaword Corpus. The code for the distillation process can be found here. This was done as part of my Master's Thesis: Task-agnostic knowledge distillation of mBERT to Swedish.

Model description

This is a 6-layer version of mBERT, having been distilled using the LightMBERT distillation method, but without freezing the embedding layer.

Intended uses & limitations

You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to be fine-tuned on a downstream task.

Training data

The data used for distillation was the 2010-2015 portion of the Swedish Culturomics Gigaword Corpus. The tokenized data had a file size of approximately 9 GB.

Evaluation results

When evaluated on the SUCX 3.0 dataset, it achieved an average F1 score of 0.859 which is competitive with the score mBERT obtained, 0.866.

When evaluated on the English WikiANN dataset, it achieved an average F1 score of 0.826 which is competitive with the score mBERT obtained, 0.849.

Additional results and comparisons are presented in my Master's Thesis

Downloads last month
2
Safetensors
Model size
135M params
Tensor type
I64
·
F32
·

Dataset used to train Addedk/mbert-swedish-distilled-cased