File size: 4,728 Bytes
1a80774
 
 
 
 
b8e9855
 
1a80774
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9892ae1
2d9a5d7
9892ae1
c2ee4fc
9892ae1
 
 
 
 
 
 
 
6ec2baa
9892ae1
 
 
 
 
 
 
 
 
 
cae6c4b
9892ae1
 
 
 
398b249
 
133eaf3
 
 
398b249
 
35c14d2
398b249
9892ae1
 
 
 
 
 
 
 
 
ae9a99a
c726c59
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ae9a99a
 
9892ae1
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
---
annotations_creators:
- crowdsourced
language:
- amh
- orm
- lin
- hau
- ibo
- kin
- lug
- luo
- pcm
- swa
- wol
- yor
- bam
- bbj
- ewe
- fon
- mos
- nya
- sna
- tsn
- twi
- xho
- zul
language_creators:
- crowdsourced
license:
- cc-by-4.0
multilinguality:
- monolingual
pretty_name: afrolm-dataset
size_categories:
- 1M<n<10M
source_datasets:
- original
tags:
- afrolm
- active learning
- language modeling
- research papers
- natural language processing
- self-active learning
task_categories:
- fill-mask
task_ids:
- masked-language-modeling
---
# AfroLM: A Self-Active Learning-based Multilingual Pretrained Language Model for 23 African Languages
- [GitHub Repository of the Paper](https://github.com/bonaventuredossou/MLM_AL)

This repository contains the model for our paper [`AfroLM: A Self-Active Learning-based Multilingual Pretrained Language Model for 23 African Languages`](https://arxiv.org/pdf/2211.03263.pdf) which will appear at the Third Simple and Efficient Natural Language Processing, at EMNLP 2022.

## Our self-active learning framework
![Model](afrolm.png)

## Languages Covered
AfroLM has been pretrained from scratch on 23 African Languages: Amharic, Afan Oromo, Bambara, Ghomalá, Éwé, Fon, Hausa, Ìgbò, Kinyarwanda, Lingala, Luganda, Luo, Mooré, Chewa, Naija, Shona, Swahili, Setswana, Twi, Wolof, Xhosa, Yorùbá, and Zulu.

## Evaluation Results
AfroLM was evaluated on MasakhaNER1.0 (10 African Languages) and MasakhaNER2.0 (21 African Languages) datasets; on text classification and sentiment analysis. AfroLM outperformed AfriBERTa, mBERT, and XLMR-base, and was very competitive with AfroXLMR. AfroLM is also very data efficient because it was pretrained on a dataset 14x+ smaller than its competitors' datasets. Below are the average F1-score performances of various models, across various datasets. Please consult our paper for more language-level performance.

Model | MasakhaNER | MasakhaNER2.0* | Text Classification (Yoruba/Hausa) | Sentiment Analysis (YOSM) | OOD Sentiment Analysis (Twitter -> YOSM) |
|:---: |:---: |:---: | :---: |:---: | :---: |
`AfroLM-Large` | **80.13** | **83.26** | **82.90/91.00** | **85.40** | **68.70** |
`AfriBERTa` | 79.10 | 81.31 | 83.22/90.86 | 82.70 | 65.90 |
`mBERT` | 71.55 | 80.68 | --- | --- | --- |
`XLMR-base` | 79.16 | 83.09 | --- | --- | --- |
`AfroXLMR-base` | `81.90` | `84.55` | --- | --- | --- |

- (*) The evaluation was made on the 11 additional languages of the dataset.
- Bold numbers represent the performance of the model with the **smallest pretrained data**.
## Pretrained Models and Dataset

**Models:**: [AfroLM-Large](https://huggingface.co/bonadossou/afrolm_active_learning) and **Dataset**: [AfroLM Dataset](https://huggingface.co/datasets/bonadossou/afrolm_active_learning_dataset)

## HuggingFace usage of AfroLM-large
```python
from transformers import XLMRobertaModel, XLMRobertaTokenizer
model = XLMRobertaModel.from_pretrained("bonadossou/afrolm_active_learning")
tokenizer = XLMRobertaTokenizer.from_pretrained("bonadossou/afrolm_active_learning")
tokenizer.model_max_length = 256
```
`Autotokenizer` class does not successfully load our tokenizer. So we recommend to use directly the `XLMRobertaTokenizer` class. Depending on your task, you will load the according mode of the model. Read the [XLMRoberta Documentation](https://huggingface.co/docs/transformers/model_doc/xlm-roberta)

## Reproducing our result: Training and Evaluation

- To train the network, run `python active_learning.py`. You can also wrap it around a `bash` script.
- For the evaluation:
    - NER Classification: `bash ner_experiments.sh`
    - Text Classification & Sentiment Analysis: `bash text_classification_all.sh`
    

## Citation
 
 ``@inproceedings{dossou-etal-2022-afrolm,
    title = "{A}fro{LM}: A Self-Active Learning-based Multilingual Pretrained Language Model for 23 {A}frican Languages",
    author = "Dossou, Bonaventure F. P.  and
      Tonja, Atnafu Lambebo  and
      Yousuf, Oreen  and
      Osei, Salomey  and
      Oppong, Abigail  and
      Shode, Iyanuoluwa  and
      Awoyomi, Oluwabusayo Olufunke  and
      Emezue, Chris",
    booktitle = "Proceedings of The Third Workshop on Simple and Efficient Natural Language Processing (SustaiNLP)",
    month = dec,
    year = "2022",
    address = "Abu Dhabi, United Arab Emirates (Hybrid)",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.sustainlp-1.11",
    pages = "52--64"}``

We will share the official proceeding citation as soon as possible. Stay tuned, and if you have liked our work, give it a star.

## Reach out

Do you have a question? Please create an issue and we will reach out as soon as possible