--- language: - amh - orm - lin - hau - ibo - kin - lug - luo - pcm - swa - wol - yor - bam - bbj - ewe - fon - mos - nya - sna - tsn - twi - xho - zul - multilingual license: - cc-by-4.0 tags: - afrolm - active learning - language modeling - research papers - natural language processing - self-active learning annotations_creators: - crowdsourced language_creators: - crowdsourced multilinguality: - monolingual pretty_name: afrolm-dataset size_categories: - 1M YOSM) | |:---: |:---: |:---: | :---: |:---: | :---: | `AfroLM-Large` | **80.13** | **83.26** | **82.90/91.00** | **85.40** | **68.70** | `AfriBERTa` | 79.10 | 81.31 | 83.22/90.86 | 82.70 | 65.90 | `mBERT` | 71.55 | 80.68 | --- | --- | --- | `XLMR-base` | 79.16 | 83.09 | --- | --- | --- | `AfroXLMR-base` | `81.90` | `84.55` | --- | --- | --- | - (*) The evaluation was made on the 11 additional languages of the dataset. - Bold numbers represent the performance of the model with the **smallest pretrained data**. ## Pretrained Models and Dataset **Models:**: [AfroLM-Large](https://huggingface.co/bonadossou/afrolm_active_learning) and **Dataset**: [AfroLM Dataset](https://huggingface.co/datasets/bonadossou/afrolm_active_learning_dataset) ## HuggingFace usage of AfroLM-large ```python from transformers import XLMRobertaModel, XLMRobertaTokenizer model = XLMRobertaModel.from_pretrained("bonadossou/afrolm_active_learning") tokenizer = XLMRobertaTokenizer.from_pretrained("bonadossou/afrolm_active_learning") tokenizer.model_max_length = 256 ``` `Autotokenizer` class does not successfully load our tokenizer. So we recommend to use directly the `XLMRobertaTokenizer` class. Depending on your task, you will load the according mode of the model. Read the [XLMRoberta Documentation](https://huggingface.co/docs/transformers/model_doc/xlm-roberta) ## Reproducing our result: Training and Evaluation - To train the network, run `python active_learning.py`. You can also wrap it around a `bash` script. - For the evaluation: - NER Classification: `bash ner_experiments.sh` - Text Classification & Sentiment Analysis: `bash text_classification_all.sh` ## Citation - **Arxiv Citation**: ``@misc{dossou2022afrolm, title={AfroLM: A Self-Active Learning-based Multilingual Pretrained Language Model for 23 African Languages}, author={Bonaventure F. P. Dossou and Atnafu Lambebo Tonja and Oreen Yousuf and Salomey Osei and Abigail Oppong and Iyanuoluwa Shode and Oluwabusayo Olufunke Awoyomi and Chris Chinenye Emezue}, year={2022}, eprint={2211.03263}, archivePrefix={arXiv}, primaryClass={cs.CL}}`` We will share the official proceeding citation as soon as possible. Stay tuned, and if you have liked our work, give it a star. ## Reach out Do you have a question? Please create an issue and we will reach out as soon as possible