MoseliMotsoehli
commited on
Commit
•
371e8c3
1
Parent(s):
2c47d42
Update README.md
Browse files
README.md
CHANGED
@@ -9,7 +9,7 @@ Pretrained model on the Tswana language using a masked language modeling (MLM) o
|
|
9 |
TswanaBERT is a transformer model pre-trained on a corpus of Setswana in a self-supervised fashion by masking part of the input words and training to predict the masks by using byte-level tokens.
|
10 |
|
11 |
## Intended uses & limitations
|
12 |
-
The model can be used for either masked language modeling or next
|
13 |
|
14 |
#### How to use
|
15 |
|
@@ -45,15 +45,15 @@ The model can be used for either masked language modeling or next word predicti
|
|
45 |
```
|
46 |
|
47 |
#### Limitations and bias
|
48 |
-
The model is trained on a relatively small collection of
|
49 |
|
50 |
## Training data
|
51 |
|
52 |
1. The largest portion of this dataset (10k) sentences of text, comes from the [Leipzig Corpora Collection](https://wortschatz.uni-leipzig.de/en/download)
|
53 |
|
54 |
-
2.
|
55 |
|
56 |
-
3.
|
57 |
|
58 |
* http://setswana.blogspot.com/
|
59 |
* https://omniglot.com/writing/tswana.php
|
|
|
9 |
TswanaBERT is a transformer model pre-trained on a corpus of Setswana in a self-supervised fashion by masking part of the input words and training to predict the masks by using byte-level tokens.
|
10 |
|
11 |
## Intended uses & limitations
|
12 |
+
The model can be used for either masked language modeling or next-word prediction. It can also be fine-tuned on a specific downstream NLP application.
|
13 |
|
14 |
#### How to use
|
15 |
|
|
|
45 |
```
|
46 |
|
47 |
#### Limitations and bias
|
48 |
+
The model is trained on a relatively small collection of sestwana, mostly from news articles and creative writings, and so is not representative enough of the language as yet.
|
49 |
|
50 |
## Training data
|
51 |
|
52 |
1. The largest portion of this dataset (10k) sentences of text, comes from the [Leipzig Corpora Collection](https://wortschatz.uni-leipzig.de/en/download)
|
53 |
|
54 |
+
2. We then added SABC news headlines collected by Marivate Vukosi, & Sefara Tshephisho, (2020) that are generously made available on [zenoodo](http://doi.org/10.5281/zenodo.3668495 ). This added 185 tswana sentences to my corpus.
|
55 |
|
56 |
+
3. We went on to add 300 more sentences by scrapping following news sites and blogs that mostly originate in Botswana. We actively continue to expand the dataset.
|
57 |
|
58 |
* http://setswana.blogspot.com/
|
59 |
* https://omniglot.com/writing/tswana.php
|