File size: 991 Bytes
b322555
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7027ae8
b322555
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
---
language: dv
---

# dv-labse

This is an experiment in cross-lingual transfer learning, to insert Dhivehi word and
word-piece tokens into Google's LaBSE model.

- Original model weights: https://huggingface.co/setu4993/LaBSE
- Original model announcement: https://ai.googleblog.com/2020/08/language-agnostic-bert-sentence.html

This currently outperforms dv-wave and dv-MuRIL (a similar transfer learning model) on 
the Maldivian News Classification task https://github.com/Sofwath/DhivehiDatasets

- mBERT: 52%
- dv-wave (ELECTRA): 89%
- dv-muril: 90.7%
- dv-labse: 91.3-91.5% (may continue training)

## Training

- Start with LaBSE (similar to mBERT) with no Thaana vocabulary
- Based on PanLex dictionaries, attach 1,100 Dhivehi words to Sinhalese or English embeddings
- Add remaining words and word-pieces from dv-wave's vocabulary to vocab.txt
- Continue BERT pretraining on Dhivehi text

CoLab notebook: 
https://colab.research.google.com/drive/1CUn44M2fb4Qbat2pAvjYqsPvWLt1Novi