mbert_c19: An mbert model pretrained on 1.5 million COVID-19 multi-dialect Arabic tweets

mBERT COVID-19 Arxiv URL is a pretrained (fine-tuned) version of the mBERT model (https://huggingface.co/bert-base-multilingual-cased). The pretraining was done using 1.5 million multi-dialect Arabic tweets regarding the COVID-19 pandemic from the “Large Arabic Twitter Dataset on COVID-19” (https://arxiv.org/abs/2004.04315). The model can achieve better results for the tasks that deal with multi-dialect Arabic tweets in relation to the COVID-19 pandemic.

Classification results for multiple tasks including fake-news and hate speech detection when using arabert_c19 and mbert_ar_c19:

For more details refer to the paper (link)

arabert mbert distilbert multi arabert Covid-19 mbert Covid-19
Contains hate (Binary) 0.8346 0.6675 0.7145 0.8649 0.8492
Talk about a cure (Binary) 0.8193 0.7406 0.7127 0.9055 0.9176
News or opinion (Binary) 0.8987 0.8332 0.8099 0.9163 0.9116
Contains fake information (Binary) 0.6415 0.5428 0.4743 0.7739 0.7228

Preprocessing

from arabert.preprocess import ArabertPreprocessor
model_name="moha/mbert_ar_c19"
arabert_prep = ArabertPreprocessor(model_name=model_name)
text = "للوقايه من عدم انتشار كورونا عليك اولا غسل اليدين بالماء والصابون وتكون عملية الغسل دقيقه تشمل راحة اليد الأصابع التركيز على الإبهام"
arabert_prep.preprocess(text)

Citation

Please cite as:

@misc{ameur2021aracovid19mfh,
      title={AraCOVID19-MFH: Arabic COVID-19 Multi-label Fake News and Hate Speech Detection Dataset}, 
      author={Mohamed Seghir Hadj Ameur and Hassina Aliane},
      year={2021},
      eprint={2105.03143},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Contacts

Hadj Ameur: Github | mohamedhadjameur@gmail.com | mhadjameur@cerist.dz

New

Select AutoNLP in the “Train” menu to fine-tune this model automatically.

Downloads last month
11
Hosted inference API
Fill-Mask
Mask token: [MASK]
This model can be loaded on the Inference API on-demand.