dsfsi
/

zabantu-xlm-roberta

masked-language-model

Inference Endpoints

Model card Files Files and versions Community

Ndamulelo Nemakhavhani commited on Nov 9, 2023

Commit

667902f

•

1 Parent(s): 804387b

Update README.md

Files changed (1) hide show

README.md +17 -1

README.md CHANGED Viewed

@@ -1,3 +1,19 @@
 # Zabantu - Explorting Multilingual Language Model training for South African Bantu Languages
 Zabantu( "Za" for South Africa, "bantu" for Bantu languages) is a collection of masked language models that have been trained from scratch using a compact dataset comprising various subsets of Bantu languages spoken in South Africa. These models are inspired by the work done on AfriBERTa, which demonstrated the effectiveness of training on XLM-R architecture using a smaller dataset. The focus of this work was to use LLMs to advance NLP applications in Tshivenda and also to serve as a benchmark for future works covering Bantu languages.
@@ -56,4 +72,4 @@ As with any language model, the generated output should be carefully reviewed an
 ## Training Data
-The models have been trained on a large corpus of text data collected from various sources, including [SADiLaR](https://repo.sadilar.org/handle/20.500.12185/7), [Leipnets](https://wortschatz.uni-leipzig.de/en/download/Venda#ven_community_2017), [Flores](https://github.com/facebookresearch/flores), [CC-100](https://data.statmt.org/cc-100/), [Opus](https://opus.nlpl.eu/opus-100.php) and various South African government websites. The training data

+---
+license: cc
+language:
+- ve
+- ts
+- zu
+- xh
+- nso
+- tn
+library_name: transformers
+tags:
+- tshivenda
+- low-resouce
+- masked-language-model
+- south africa
+---
 # Zabantu - Explorting Multilingual Language Model training for South African Bantu Languages
 Zabantu( "Za" for South Africa, "bantu" for Bantu languages) is a collection of masked language models that have been trained from scratch using a compact dataset comprising various subsets of Bantu languages spoken in South Africa. These models are inspired by the work done on AfriBERTa, which demonstrated the effectiveness of training on XLM-R architecture using a smaller dataset. The focus of this work was to use LLMs to advance NLP applications in Tshivenda and also to serve as a benchmark for future works covering Bantu languages.
 ## Training Data
+The models have been trained on a large corpus of text data collected from various sources, including [SADiLaR](https://repo.sadilar.org/handle/20.500.12185/7), [Leipnets](https://wortschatz.uni-leipzig.de/en/download/Venda#ven_community_2017), [Flores](https://github.com/facebookresearch/flores), [CC-100](https://data.statmt.org/cc-100/), [Opus](https://opus.nlpl.eu/opus-100.php) and various South African government websites. The training data