Ndamulelo Nemakhavhani commited on
Commit
804387b
1 Parent(s): 5962540

Adding model card

Browse files

More information on the model can be obtained on this presentation available on YouTube:

Files changed (1) hide show
  1. README.md +59 -0
README.md ADDED
@@ -0,0 +1,59 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Zabantu - Explorting Multilingual Language Model training for South African Bantu Languages
2
+
3
+ Zabantu( "Za" for South Africa, "bantu" for Bantu languages) is a collection of masked language models that have been trained from scratch using a compact dataset comprising various subsets of Bantu languages spoken in South Africa. These models are inspired by the work done on AfriBERTa, which demonstrated the effectiveness of training on XLM-R architecture using a smaller dataset. The focus of this work was to use LLMs to advance NLP applications in Tshivenda and also to serve as a benchmark for future works covering Bantu languages.
4
+
5
+ ## Model Overview
6
+
7
+ This model card provides an overview of the multilingual language models developed for South African languages, with a specific focus on advancing Tshivenda natural language processing (NLP) coverage. Zabantu-XLMR refers to a fleet of models trained on different combinations of South African Bantu languages. These include:
8
+
9
+ - Zabantu-VEN: A monolingual language model trained on 73k raw sentences in Tshivenda
10
+ - Zabantu-NSO: A monolingual language model trained on 179k raw sentences in Sepedi
11
+ - Zabantu-NSO+VEN: A bilingual language model trained on 179k raw sentences in Sepedi and 73k sentences in Tshivenda
12
+ - Zabantu-SOT+VEN: A multilingual language model trained on 479k raw sentences from Sesotho, Sepedi, Setswana, and Tshivenda
13
+ - Zabantu-BANTU: A multilingual language model trained on 1.4M raw sentences from 9 South African Bantu languages
14
+
15
+ ## Model Details
16
+
17
+ - **Model Name:** Zabantu-XLMR
18
+ - **Model Version:** 1.0.0
19
+ - **Model Architecture:** [XLM-RoBERTa architecture](https://arxiv.org/abs/1911.02116)
20
+ - **Model Size:** 80 - 250 million parameters
21
+ - **Language Support:** Tshivenda, Nguni languages (Zulu, Xhosa, Swati), Sotho languages (Northern Sotho, Southern Sotho, Setswana), and Xitsonga.
22
+
23
+ ## Intended Use
24
+
25
+ The Zabantu models are intended to be used for various NLP tasks involving Tshivenda and related South African languages. In addition, the model can be fine-tuned on a variety of downstream tasks, such as:
26
+
27
+ - Text classification and sentiment analysis in Tshivenda and related languages.
28
+ - Named Entity Recognition (NER) for identifying entities in Tshivenda text.
29
+ - Machine Translation between Tshivenda and other South African languages.
30
+ - Cross-lingual document retrieval and question answering.
31
+
32
+ ## Performance and Limitations
33
+
34
+ - **Performance:** The Zabantu models demonstrate promising performance on various NLP tasks, including news topic classification with competitive results compared to similar pre-trained cross-lingual models such as [AfriBERTa](https://huggingface.co/castorini/afriberta_base) and [AfroXLMR](https://huggingface.co/Davlan/afro-xlmr-base).
35
+
36
+ **Monolingual test F1 scores on News Topic Classification**
37
+
38
+ | Weighted F1 [%] | Afriberta-large | Afroxlmr | zabantu-nsoven | zabantu-sotven | zabantu-bantu |
39
+ |-----------------|-----------------|----------|----------------|----------------|---------------|
40
+ | nso | 71.4 | 71.6 | 74.3 | 69 | 70.6 |
41
+ | ven | 74.3 | 74.1 | 77 | 76 | 75.6 |
42
+
43
+ **Few-shot(50 shots) test F1 scores on News Topic Classification**
44
+
45
+ | Weighted F1 [%] | Afriberta | Afroxlmr | zabantu-nsoven | zabantu-sotven | zabantu-bantu |
46
+ |-----------------|-----------|----------|----------------|----------------|---------------|
47
+ | ven | 60 | 62 | 66 | 69 | 55 |
48
+
49
+ - **Limitations:**
50
+
51
+ Although efforts have been made to include a wide range of South African languages, the model's coverage may still be limited for certain dialects. We note that the training set was largely dominated by Setwana and IsiXhosa.
52
+
53
+ We also acknowledge the potential to further improve the model by training it on more data, including additional domains and topics.
54
+
55
+ As with any language model, the generated output should be carefully reviewed and post-processed to ensure accuracy and cultural sensitivity.
56
+
57
+ ## Training Data
58
+
59
+ The models have been trained on a large corpus of text data collected from various sources, including [SADiLaR](https://repo.sadilar.org/handle/20.500.12185/7), [Leipnets](https://wortschatz.uni-leipzig.de/en/download/Venda#ven_community_2017), [Flores](https://github.com/facebookresearch/flores), [CC-100](https://data.statmt.org/cc-100/), [Opus](https://opus.nlpl.eu/opus-100.php) and various South African government websites. The training data