dsfsi
/

zabantu-xlm-roberta

+# Zabantu - Explorting Multilingual Language Model training for South African Bantu Languages
+Zabantu( "Za" for South Africa, "bantu" for Bantu languages) is a collection of masked language models that have been trained from scratch using a compact dataset comprising various subsets of Bantu languages spoken in South Africa. These models are inspired by the work done on AfriBERTa, which demonstrated the effectiveness of training on XLM-R architecture using a smaller dataset. The focus of this work was to use LLMs to advance NLP applications in Tshivenda and also to serve as a benchmark for future works covering Bantu languages.
+## Model Overview
+This model card provides an overview of the multilingual language models developed for South African languages, with a specific focus on advancing Tshivenda natural language processing (NLP) coverage. Zabantu-XLMR refers to a fleet of models trained on different combinations of South African Bantu languages. These include:
+- Zabantu-VEN: A monolingual language model trained on 73k raw sentences in Tshivenda
+- Zabantu-NSO: A monolingual language model trained on 179k raw sentences in Sepedi
+- Zabantu-NSO+VEN: A bilingual language model trained on 179k raw sentences in Sepedi and 73k sentences in Tshivenda
+- Zabantu-SOT+VEN: A multilingual language model trained on 479k raw sentences from Sesotho, Sepedi, Setswana, and Tshivenda
+- Zabantu-BANTU: A multilingual language model trained on 1.4M raw sentences from 9 South African Bantu languages
+## Model Details
+- **Model Name:** Zabantu-XLMR
+- **Model Version:** 1.0.0
+- **Model Architecture:** [XLM-RoBERTa architecture](https://arxiv.org/abs/1911.02116)
+- **Model Size:** 80 - 250 million parameters
+- **Language Support:** Tshivenda, Nguni languages (Zulu, Xhosa, Swati), Sotho languages (Northern Sotho, Southern Sotho, Setswana), and Xitsonga.
+## Intended Use
+The Zabantu models are intended to be used for various NLP tasks involving Tshivenda and related South African languages. In addition, the model can be fine-tuned on a variety of downstream tasks, such as:
+- Text classification and sentiment analysis in Tshivenda and related languages.
+- Named Entity Recognition (NER) for identifying entities in Tshivenda text.
+- Machine Translation between Tshivenda and other South African languages.
+- Cross-lingual document retrieval and question answering.
+## Performance and Limitations
+- **Performance:** The Zabantu models demonstrate promising performance on various NLP tasks, including news topic classification with competitive results compared to similar pre-trained cross-lingual models such as [AfriBERTa](https://huggingface.co/castorini/afriberta_base) and [AfroXLMR](https://huggingface.co/Davlan/afro-xlmr-base).
+**Monolingual test F1 scores on News Topic Classification**
+| Weighted F1 [%] | Afriberta-large | Afroxlmr | zabantu-nsoven | zabantu-sotven | zabantu-bantu |
+|-----------------|-----------------|----------|----------------|----------------|---------------|
+| nso             | 71.4            | 71.6     | 74.3           | 69             | 70.6          |
+| ven             | 74.3            | 74.1     | 77             | 76             | 75.6          |
+**Few-shot(50 shots) test F1 scores on News Topic Classification**
+| Weighted F1 [%] | Afriberta | Afroxlmr | zabantu-nsoven | zabantu-sotven | zabantu-bantu |
+|-----------------|-----------|----------|----------------|----------------|---------------|
+| ven             | 60        | 62       | 66             | 69             | 55            |
+- **Limitations:**
+Although efforts have been made to include a wide range of South African languages, the model's coverage may still be limited for certain dialects. We note that the training set was largely dominated by Setwana and IsiXhosa.
+We also acknowledge the potential to further improve the model by training it on more data, including additional domains and topics.
+As with any language model, the generated output should be carefully reviewed and post-processed to ensure accuracy and cultural sensitivity.
+## Training Data
+The models have been trained on a large corpus of text data collected from various sources, including [SADiLaR](https://repo.sadilar.org/handle/20.500.12185/7), [Leipnets](https://wortschatz.uni-leipzig.de/en/download/Venda#ven_community_2017), [Flores](https://github.com/facebookresearch/flores), [CC-100](https://data.statmt.org/cc-100/), [Opus](https://opus.nlpl.eu/opus-100.php) and various South African government websites. The training data