--- language: - en - ha - yo - ig - pcm pipeline_tag: fill-mask --- # NaijaXLM-T-base This is a XLM-Roberta-base model further pretrained on 2.2 billion Nigerian tweets, described and evaluated in the [reference paper](https://arxiv.org/abs/2403.19260). This model was developed together with [@pvcastro](https://huggingface.co/pvcastro). ## Model Details ### Model Description - **Model type:** xlm-roberta - **Language(s) (NLP):** (Nigerian) English, Nigerian Pidgin, Hausa, Yoruba, Igbo - **Finetuned from model [optional]:** xlm-roberta-base ### Model Sources [optional] - **Repository:** https://github.com/manueltonneau/hate_speech_nigeria - **Paper:** https://arxiv.org/abs/2403.19260 ## Training Details ### Training Data The model was further pre-trained on 2.2 billion tweets posted between March 2007 and July 2023 and forming the timelines of 2.8 million Twitter users with a profile location in Nigeria. ### Training Procedure We performed an adaptive fine tuning of XLM-R on the Nigerian Twitter dataset. We kept the same vocabulary as XLM-R and trained the model until convergence for a total of one epoch, using 1\% of the dataset as validation set. The training procedure was conducted in a distributed environment, for approximately 10 days, using 4 nodes with 4 RTX 8000 GPUs each, with a total batch size of 576. ## Evaluation ## BibTeX entry and citation information Please cite the [reference paper](https://arxiv.org/abs/2403.19260) if you use this model. ```bibtex @article{tonneau2024naijahate, title={NaijaHate: Evaluating Hate Speech Detection on Nigerian Twitter Using Representative Data}, author={Tonneau, Manuel and de Castro, Pedro Vitor Quinta and Lasri, Karim and Farouq, Ibrahim and Subramanian, Lakshminarayanan and Orozco-Olvera, Victor and Fraiberger, Samuel}, journal={arXiv preprint arXiv:2403.19260}, year={2024} } ```