Edit model card

BERT base for Dhivehi

Pretrained model on Dhivehi language using masked language modeling (MLM).

Tokenizer

The WordPiece tokenizer uses several components:

  • Normalization: lowercase and then NFKD unicode normalization.
  • Pretokenization: splits by whitespace and punctuation.
  • Postprocessing: single sentences are output in format [CLS] sentence A [SEP] and pair sentences in format [CLS] sentence A [SEP] sentence B [SEP].

Training

Training was performed over 16M+ Dhivehi sentences/paragraphs put together by @ashraq. An Adam optimizer with weighted decay was used with following parameters:

  • Learning rate: 1e-5
  • Weight decay: 0.1
  • Warmup steps: 10% of data
Downloads last month
3