system HF staff commited on
Commit
00d9f1a
1 Parent(s): bc74c0e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +10 -5
README.md CHANGED
@@ -3,18 +3,23 @@
3
  This is a second attempt at a Dhivehi language model trained with
4
  Google Research's [ELECTRA](https://github.com/google-research/electra).
5
 
6
- Tokenization and training CoLab: https://colab.research.google.com/drive/1ZJ3tU9MwyWj6UtQ-8G7QJKTn-hG1uQ9v?usp=sharing
7
 
8
- V1: similar performance to mBERT after 3 epochs
 
 
9
 
10
  V2: fixed tokenizers do_lower_case=False and strip_accents=False to preserve vowel signs of Dhivehi
 
11
 
12
  ## Corpus
13
 
14
- Trained on @Sofwath's 307MB corpus of Dhivehi news: https://github.com/Sofwath/DhivehiDatasets
 
 
15
 
16
- [OSCAR](https://oscar-corpus.com/) was considered; as of this writing their web crawl has 126MB
17
- of Dhivehi text (79MB deduped).
18
 
19
  ## Vocabulary
20
 
 
3
  This is a second attempt at a Dhivehi language model trained with
4
  Google Research's [ELECTRA](https://github.com/google-research/electra).
5
 
6
+ Tokenization and pre-training CoLab: https://colab.research.google.com/drive/1ZJ3tU9MwyWj6UtQ-8G7QJKTn-hG1uQ9v?usp=sharing
7
 
8
+ Using SimpleTransformers to classify news https://colab.research.google.com/drive/1KnyQxRNWG_yVwms_x9MUAqFQVeMecTV7?usp=sharing
9
+
10
+ V1: similar performance to mBERT on news classification task after finetuning for 3 epochs (52%)
11
 
12
  V2: fixed tokenizers do_lower_case=False and strip_accents=False to preserve vowel signs of Dhivehi
13
+ dv-wave: 89% to mBERT: 52%
14
 
15
  ## Corpus
16
 
17
+ Trained on @Sofwath's 307MB corpus of Dhivehi text: https://github.com/Sofwath/DhivehiDatasets
18
+
19
+ This repo also contains the news classification task CSV
20
 
21
+ [OSCAR](https://oscar-corpus.com/) was considered but has not been added to pretraining; as of
22
+ this writing their web crawl has 126MB of Dhivehi text (79MB deduped).
23
 
24
  ## Vocabulary
25