Ndamulelo Nemakhavhani commited on
Commit
c0d385d
1 Parent(s): 667902f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +25 -15
README.md CHANGED
@@ -14,28 +14,32 @@ tags:
14
  - masked-language-model
15
  - south africa
16
  ---
17
- # Zabantu - Explorting Multilingual Language Model training for South African Bantu Languages
18
-
19
- Zabantu( "Za" for South Africa, "bantu" for Bantu languages) is a collection of masked language models that have been trained from scratch using a compact dataset comprising various subsets of Bantu languages spoken in South Africa. These models are inspired by the work done on AfriBERTa, which demonstrated the effectiveness of training on XLM-R architecture using a smaller dataset. The focus of this work was to use LLMs to advance NLP applications in Tshivenda and also to serve as a benchmark for future works covering Bantu languages.
20
 
21
- ## Model Overview
22
 
23
- This model card provides an overview of the multilingual language models developed for South African languages, with a specific focus on advancing Tshivenda natural language processing (NLP) coverage. Zabantu-XLMR refers to a fleet of models trained on different combinations of South African Bantu languages. These include:
24
 
25
- - Zabantu-VEN: A monolingual language model trained on 73k raw sentences in Tshivenda
26
- - Zabantu-NSO: A monolingual language model trained on 179k raw sentences in Sepedi
27
- - Zabantu-NSO+VEN: A bilingual language model trained on 179k raw sentences in Sepedi and 73k sentences in Tshivenda
28
- - Zabantu-SOT+VEN: A multilingual language model trained on 479k raw sentences from Sesotho, Sepedi, Setswana, and Tshivenda
29
- - Zabantu-BANTU: A multilingual language model trained on 1.4M raw sentences from 9 South African Bantu languages
30
 
31
- ## Model Details
32
 
33
- - **Model Name:** Zabantu-XLMR
34
- - **Model Version:** 1.0.0
35
  - **Model Architecture:** [XLM-RoBERTa architecture](https://arxiv.org/abs/1911.02116)
36
  - **Model Size:** 80 - 250 million parameters
37
  - **Language Support:** Tshivenda, Nguni languages (Zulu, Xhosa, Swati), Sotho languages (Northern Sotho, Southern Sotho, Setswana), and Xitsonga.
38
 
 
 
 
 
 
 
 
 
 
 
 
 
39
  ## Intended Use
40
 
41
  The Zabantu models are intended to be used for various NLP tasks involving Tshivenda and related South African languages. In addition, the model can be fine-tuned on a variety of downstream tasks, such as:
@@ -70,6 +74,12 @@ We also acknowledge the potential to further improve the model by training it on
70
 
71
  As with any language model, the generated output should be carefully reviewed and post-processed to ensure accuracy and cultural sensitivity.
72
 
73
- ## Training Data
 
 
 
 
 
 
74
 
75
- The models have been trained on a large corpus of text data collected from various sources, including [SADiLaR](https://repo.sadilar.org/handle/20.500.12185/7), [Leipnets](https://wortschatz.uni-leipzig.de/en/download/Venda#ven_community_2017), [Flores](https://github.com/facebookresearch/flores), [CC-100](https://data.statmt.org/cc-100/), [Opus](https://opus.nlpl.eu/opus-100.php) and various South African government websites. The training data
 
14
  - masked-language-model
15
  - south africa
16
  ---
 
 
 
17
 
18
+ # Zabantu - Explorting Multilingual Language Model training for South African Bantu Languages
19
 
20
+ > Zabantu( "Za" for South Africa, "bantu" for Bantu languages) is a collection of masked language models that have been trained from scratch using a compact dataset comprising various subsets of Bantu languages spoken in South Africa. These models are inspired by the work done on AfriBERTa, which demonstrated the effectiveness of training on XLM-R architecture using a smaller dataset. The focus of this work was to use LLMs to advance NLP applications in Tshivenda and also to serve as a benchmark for future works covering Bantu languages.
21
 
 
 
 
 
 
22
 
23
+ # Model Details
24
 
25
+ - **Model Name:** Zabantu-XLM-Roberta
26
+ - **Model Version:** 0.0.1
27
  - **Model Architecture:** [XLM-RoBERTa architecture](https://arxiv.org/abs/1911.02116)
28
  - **Model Size:** 80 - 250 million parameters
29
  - **Language Support:** Tshivenda, Nguni languages (Zulu, Xhosa, Swati), Sotho languages (Northern Sotho, Southern Sotho, Setswana), and Xitsonga.
30
 
31
+
32
+ ## Model Varients
33
+
34
+ This model card provides an overview of the multilingual language models developed for South African languages, with a specific focus on advancing Tshivenda natural language processing (NLP) coverage. Zabantu-XLMR refers to a fleet of models trained on different combinations of South African Bantu languages. These include:
35
+
36
+ - [Zabantu-VEN](https://huggingface.co/dsfsi/zabantu-ven-120m): A monolingual language model trained on 73k raw sentences in Tshivenda
37
+ - [Zabantu-NSO](https://huggingface.co/dsfsi/zabantu-nso-80m): A monolingual language model trained on 179k raw sentences in Sepedi
38
+ - [Zabantu-NSO+VEN](https://huggingface.co/dsfsi/zabantu-nso-ven-170m): A bilingual language model trained on 179k raw sentences in Sepedi and 73k sentences in Tshivenda
39
+ - [Zabantu-SOT+VEN](https://huggingface.co/dsfsi/zabantu-sot-ven-170m): A multilingual language model trained on 479k raw sentences from Sesotho, Sepedi, Setswana, and Tshivenda
40
+ - [Zabantu-BANTU](https://huggingface.co/dsfsi/zabantu-bantu-250m): A multilingual language model trained on 1.4M raw sentences from 9 South African Bantu languages
41
+
42
+
43
  ## Intended Use
44
 
45
  The Zabantu models are intended to be used for various NLP tasks involving Tshivenda and related South African languages. In addition, the model can be fine-tuned on a variety of downstream tasks, such as:
 
74
 
75
  As with any language model, the generated output should be carefully reviewed and post-processed to ensure accuracy and cultural sensitivity.
76
 
77
+ # Training Data
78
+
79
+ The models have been trained on a large corpus of text data collected from various sources, including [SADiLaR](https://repo.sadilar.org/handle/20.500.12185/7), [Leipnets](https://wortschatz.uni-leipzig.de/en/download/Venda#ven_community_2017), [Flores](https://github.com/facebookresearch/flores), [CC-100](https://data.statmt.org/cc-100/), [Opus](https://opus.nlpl.eu/opus-100.php) and various South African government websites. The training data covers a wide range of topics and domains, notably religion, politics, academics and health (mostly Covid-19).
80
+
81
+ <hr/>
82
+
83
+ # Closing Remarks
84
 
85
+ The Zabantu models provide a valuable resource for advancing Tshivenda NLP coverage and promoting cross-lingual learning techniques for South African languages. They have the potential to enhance various NLP applications, foster linguistic diversity, and contribute to the development of language technologies in the South African context.