SloBERTa-SlEng

SloBERTa-SlEng is a masked language model, based on the SloBERTa Slovene model.

SloBERTa-SlEng replaces the tokenizer, vocabulary and the embeddings layer of the SloBERTa model. The tokenizer and vocabulary used are bilingual, Slovene-English, based on conversational, non-standard, and slang language the model was trained on. They are the same as in the SlEng-bert model. The new embedding weights were initialized from the SloBERTa embeddings.

The new SloBERTa-SlEng model is SloBERTa model, which was further pre-trained for two epochs on the conversational English and Slovene corpora, the same as the SlEng-bert model.

Training corpora

The model was trained on English and Slovene tweets, Slovene corpora MaCoCu and Frenk, and a small subset of English Oscar corpus. We tried to keep the sizes of English and Slovene corpora as equal as possible. Training corpora had in total about 2.7 billion words.

Framework versions

Transformers 4.22.0.dev0
Pytorch 1.13.0a0+d321be6
Datasets 2.4.0
Tokenizers 0.12.1