Pretrained model on Dhivehi language using masked language modeling (MLM).
The WordPiece tokenizer uses several components:
- Normalization: lowercase and then NFKD unicode normalization.
- Pretokenization: splits by whitespace and punctuation.
- Postprocessing: single sentences are output in format
[CLS] sentence A [SEP]and pair sentences in format
[CLS] sentence A [SEP] sentence B [SEP].
Training was performed over 16M+ Dhivehi sentences/paragraphs put together by @ashraq. An Adam optimizer with weighted decay was used with following parameters:
- Learning rate: 1e-5
- Weight decay: 0.1
- Warmup steps: 10% of data
- Downloads last month