language:
- en
- ha
- yo
- ig
- pcm
pipeline_tag: fill-mask
NaijaXLM-T-base
This is a XLM-Roberta-base model further pretrained on 2.2 billion Nigerian tweets, described and evaluated in the reference paper (TODO). This model was developed together with @pvcastro.
Model Details
Model Description
- Model type: xlm-roberta
- Language(s) (NLP): (Nigerian) English, Nigerian Pidgin, Hausa, Yoruba, Igbo
- Finetuned from model [optional]: xlm-roberta-base
Model Sources [optional]
- Repository: https://github.com/manueltonneau/hate_speech_nigeria
- Paper [optional]: TODO
Training Details
Training Data
The model was further pre-trained on 2.2 billion tweets posted between March 2007 and July 2023 and forming the timelines of 2.8 million Twitter users with a profile location in Nigeria.
Training Procedure
We performed an adaptive fine tuning of XLM-R on the Nigerian Twitter dataset. We kept the same vocabulary as XLM-R and trained the model for one epoch, using 1% of the dataset as validation set. The training procedure was conducted in a distributed environment, for approximately 10 days, using 4 nodes with 4 RTX 8000 GPUs each, with a total batch size of 576.
Evaluation
BibTeX entry and citation information
TODO
Please cite the reference paper (TODO) if you use this model.
@inproceedings{XXX}