--- annotations_creators: - crowdsourced language: - amh - orm - lin - hau - ibo - kin - lug - luo - pcm - swa - wol - yor - bam - bbj - ewe - fon - mos - nya - sna - tsn - twi - xho - zul language_creators: - crowdsourced license: - cc-by-4.0 multilinguality: - monolingual pretty_name: afrolm-dataset size_categories: - 1M YOSM) | |:---: |:---: |:---: | :---: |:---: | :---: | `AfroLM-Large` | **80.13** | **83.26** | **82.90/91.00** | **85.40** | **68.70** | `AfriBERTa` | 79.10 | 81.31 | 83.22/90.86 | 82.70 | 65.90 | `mBERT` | 71.55 | 80.68 | --- | --- | --- | `XLMR-base` | 79.16 | 83.09 | --- | --- | --- | `AfroXLMR-base` | `81.90` | `84.55` | --- | --- | --- | - (*) The evaluation was made on the 11 additional languages of the dataset. - Bold numbers represent the performance of the model with the **smallest pretrained data**. ## Pretrained Models and Dataset **Models:**: [AfroLM-Large](https://huggingface.co/bonadossou/afrolm_active_learning) and **Dataset**: [AfroLM Dataset](https://huggingface.co/datasets/bonadossou/afrolm_active_learning_dataset) ## HuggingFace usage of AfroLM-large ```python from transformers import XLMRobertaModel, XLMRobertaTokenizer model = XLMRobertaModel.from_pretrained("bonadossou/afrolm_active_learning") tokenizer = XLMRobertaTokenizer.from_pretrained("bonadossou/afrolm_active_learning") tokenizer.model_max_length = 256 ``` `Autotokenizer` class does not successfully load our tokenizer. So we recommend to use directly the `XLMRobertaTokenizer` class. Depending on your task, you will load the according mode of the model. Read the [XLMRoberta Documentation](https://huggingface.co/docs/transformers/model_doc/xlm-roberta) ## Reproducing our result: Training and Evaluation - To train the network, run `python active_learning.py`. You can also wrap it around a `bash` script. - For the evaluation: - NER Classification: `bash ner_experiments.sh` - Text Classification & Sentiment Analysis: `bash text_classification_all.sh` ## Citation ``@inproceedings{dossou-etal-2022-afrolm, title = "{A}fro{LM}: A Self-Active Learning-based Multilingual Pretrained Language Model for 23 {A}frican Languages", author = "Dossou, Bonaventure F. P. and Tonja, Atnafu Lambebo and Yousuf, Oreen and Osei, Salomey and Oppong, Abigail and Shode, Iyanuoluwa and Awoyomi, Oluwabusayo Olufunke and Emezue, Chris", booktitle = "Proceedings of The Third Workshop on Simple and Efficient Natural Language Processing (SustaiNLP)", month = dec, year = "2022", address = "Abu Dhabi, United Arab Emirates (Hybrid)", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2022.sustainlp-1.11", pages = "52--64"}`` We will share the official proceeding citation as soon as possible. Stay tuned, and if you have liked our work, give it a star. ## Reach out Do you have a question? Please create an issue and we will reach out as soon as possible