Ndamulelo Nemakhavhani commited on
Commit
667902f
1 Parent(s): 804387b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +17 -1
README.md CHANGED
@@ -1,3 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  # Zabantu - Explorting Multilingual Language Model training for South African Bantu Languages
2
 
3
  Zabantu( "Za" for South Africa, "bantu" for Bantu languages) is a collection of masked language models that have been trained from scratch using a compact dataset comprising various subsets of Bantu languages spoken in South Africa. These models are inspired by the work done on AfriBERTa, which demonstrated the effectiveness of training on XLM-R architecture using a smaller dataset. The focus of this work was to use LLMs to advance NLP applications in Tshivenda and also to serve as a benchmark for future works covering Bantu languages.
@@ -56,4 +72,4 @@ As with any language model, the generated output should be carefully reviewed an
56
 
57
  ## Training Data
58
 
59
- The models have been trained on a large corpus of text data collected from various sources, including [SADiLaR](https://repo.sadilar.org/handle/20.500.12185/7), [Leipnets](https://wortschatz.uni-leipzig.de/en/download/Venda#ven_community_2017), [Flores](https://github.com/facebookresearch/flores), [CC-100](https://data.statmt.org/cc-100/), [Opus](https://opus.nlpl.eu/opus-100.php) and various South African government websites. The training data
 
1
+ ---
2
+ license: cc
3
+ language:
4
+ - ve
5
+ - ts
6
+ - zu
7
+ - xh
8
+ - nso
9
+ - tn
10
+ library_name: transformers
11
+ tags:
12
+ - tshivenda
13
+ - low-resouce
14
+ - masked-language-model
15
+ - south africa
16
+ ---
17
  # Zabantu - Explorting Multilingual Language Model training for South African Bantu Languages
18
 
19
  Zabantu( "Za" for South Africa, "bantu" for Bantu languages) is a collection of masked language models that have been trained from scratch using a compact dataset comprising various subsets of Bantu languages spoken in South Africa. These models are inspired by the work done on AfriBERTa, which demonstrated the effectiveness of training on XLM-R architecture using a smaller dataset. The focus of this work was to use LLMs to advance NLP applications in Tshivenda and also to serve as a benchmark for future works covering Bantu languages.
 
72
 
73
  ## Training Data
74
 
75
+ The models have been trained on a large corpus of text data collected from various sources, including [SADiLaR](https://repo.sadilar.org/handle/20.500.12185/7), [Leipnets](https://wortschatz.uni-leipzig.de/en/download/Venda#ven_community_2017), [Flores](https://github.com/facebookresearch/flores), [CC-100](https://data.statmt.org/cc-100/), [Opus](https://opus.nlpl.eu/opus-100.php) and various South African government websites. The training data