File size: 1,160 Bytes
6e2c345
121120f
 
 
f25a273
121120f
6e2c345
121120f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
---
language:
- sl
- en
- multilingual
licence: cc-by-sa-4.0
---

# SlEng-bert

SlEng-bert is a bilingual, Slovene-English masked language model.

SlEng-bert was trained from scratch on Slovene and English, conversational, non-standard, and slang language.
The model has 12 transformer layers, and is roughly equal in size to BERT and RoBERTa base models. The pre-training task used was masked language modeling, with no other tasks (like NSP).

The tokenizer and corpora used to train SlEng-bert were also used for training the [SloBERTa-SlEng](https://huggingface.co/cjvt/sloberta-sleng) model.
The difference between the two is: SlEng-bert was trained from scratch for 40 epochs; SloBERTa-SlEng is SloBERTa further pre-trained for 2 epochs on new corpora.

## Training corpora

The model was trained on English and Slovene tweets, Slovene corpora [MaCoCu](http://hdl.handle.net/11356/1517) and [Frenk](http://hdl.handle.net/11356/1201),
and a small subset of English [Oscar](https://huggingface.co/datasets/oscar) corpus. We tried to keep the sizes of English and Slovene corpora as equal as possible.
Training corpora had in total about 2.7 billion words.