arabert_c19: An Arabert model pretrained on 1.5 million COVID-19 multi-dialect Arabic tweets

ARABERT COVID-19 Arxiv URL is a pretrained (fine-tuned) version of the AraBERT v2 model (https://huggingface.co/aubmindlab/bert-base-arabertv02). The pretraining was done using 1.5 million multi-dialect Arabic tweets regarding the COVID-19 pandemic from the “Large Arabic Twitter Dataset on COVID-19” (https://arxiv.org/abs/2004.04315). The model can achieve better results for the tasks that deal with multi-dialect Arabic tweets in relation to the COVID-19 pandemic.

Classification results for multiple tasks including fake-news and hate speech detection when using arabert_c19 and mbert_ar_c19:

For more details refer to the paper (link)

arabert mbert distilbert multi arabert Covid-19 mbert Covid-19
Contains hate (Binary) 0.8346 0.6675 0.7145 0.8649 0.8492
Talk about a cure (Binary) 0.8193 0.7406 0.7127 0.9055 0.9176
News or opinion (Binary) 0.8987 0.8332 0.8099 0.9163 0.9116
Contains fake information (Binary) 0.6415 0.5428 0.4743 0.7739 0.7228

Preprocessing

from arabert.preprocess import ArabertPreprocessor
model_name="moha/arabert_c19"
arabert_prep = ArabertPreprocessor(model_name=model_name)
text = "للوقايه من عدم انتشار كورونا عليك اولا غسل اليدين بالماء والصابون وتكون عملية الغسل دقيقه تشمل راحة اليد الأصابع التركيز على الإبهام"
arabert_prep.preprocess(text)

Citation

Please cite as:

@misc{ameur2021aracovid19mfh,
      title={AraCOVID19-MFH: Arabic COVID-19 Multi-label Fake News and Hate Speech Detection Dataset}, 
      author={Mohamed Seghir Hadj Ameur and Hassina Aliane},
      year={2021},
      eprint={2105.03143},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Contacts

Hadj Ameur: Github | mohamedhadjameur@gmail.com | mhadjameur@cerist.dz

Downloads last month
414
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.