jamescalam commited on
Commit
1abd917
1 Parent(s): a3fe765

updated readme

Browse files
Files changed (1) hide show
  1. README.md +25 -1
README.md CHANGED
@@ -1 +1,25 @@
1
- "Dhivehi BERT"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - dv
4
+ license: apache-2.0
5
+ ---
6
+
7
+ # BERT base for Dhivehi
8
+
9
+ Pretrained model on Dhivehi language using masked language modeling (MLM).
10
+
11
+ ## Tokenizer
12
+
13
+ The *WordPiece* tokenizer uses several components:
14
+
15
+ * **Normalization**: lowercase and then NFKD unicode normalization.
16
+ * **Pretokenization**: splits by whitespace and punctuation.
17
+ * **Postprocessing**: single sentences are output in format `[CLS] sentence A [SEP]` and pair sentences in format `[CLS] sentence A [SEP] sentence B [SEP]`.
18
+
19
+ ## Training
20
+
21
+ Training was performed over 16M+ Dhivehi sentences/paragraphs. An Adam optimizer with weighted decay was used with following parameters:
22
+
23
+ * Learning rate: 1e-5
24
+ * Weight decay: 0.1
25
+ * Warmup steps: 10% of data