KB-BERT distilled base model (cased)

This model is a distilled version of KB-BERT. It was distilled using Swedish data, the 2010-2015 portion of the Swedish Culturomics Gigaword Corpus. The code for the distillation process can be found here. This was done as part of my Master's Thesis: Task-agnostic knowledge distillation of mBERT to Swedish.

Model description

This is a 6-layer version of KB-BERT, having been distilled using the LightMBERT distillation method, but without freezing the embedding layer.

Intended uses & limitations

You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to be fine-tuned on a downstream task.

Training data

The data used for distillation was the 2010-2015 portion of the Swedish Culturomics Gigaword Corpus. The tokenized data had a file size of approximately 7.4 GB.

Evaluation results

When evaluated on the SUCX 3.0 dataset, it achieved an average F1 score of 0.887 which is competitive with the score KB-BERT obtained, 0.894.

Additional results and comparisons are presented in my Master's Thesis