Edit model card

mbert_c19: An mbert model pretrained on 1.5 million COVID-19 multi-dialect Arabic tweets

mBERT COVID-19 Arxiv URL is a pretrained (fine-tuned) version of the mBERT model (https://huggingface.co/bert-base-multilingual-cased). The pretraining was done using 1.5 million multi-dialect Arabic tweets regarding the COVID-19 pandemic from the “Large Arabic Twitter Dataset on COVID-19” (https://arxiv.org/abs/2004.04315). The model can achieve better results for the tasks that deal with multi-dialect Arabic tweets in relation to the COVID-19 pandemic.

Classification results for multiple tasks including fake-news and hate speech detection when using arabert_c19 and mbert_ar_c19:

For more details refer to the paper (link)

arabert mbert distilbert multi arabert Covid-19 mbert Covid-19
Contains hate (Binary) 0.8346 0.6675 0.7145 0.8649 0.8492
Talk about a cure (Binary) 0.8193 0.7406 0.7127 0.9055 0.9176
News or opinion (Binary) 0.8987 0.8332 0.8099 0.9163 0.9116
Contains fake information (Binary) 0.6415 0.5428 0.4743 0.7739 0.7228

Preprocessing

from arabert.preprocess import ArabertPreprocessor
model_name="moha/mbert_ar_c19"
arabert_prep = ArabertPreprocessor(model_name=model_name)
text = "للوقايه من عدم انتشار كورونا عليك اولا غسل اليدين بالماء والصابون وتكون عملية الغسل دقيقه تشمل راحة اليد الأصابع التركيز على الإبهام"
arabert_prep.preprocess(text)

Citation

Please cite as:

@misc{ameur2021aracovid19mfh,
      title={AraCOVID19-MFH: Arabic COVID-19 Multi-label Fake News and Hate Speech Detection Dataset}, 
      author={Mohamed Seghir Hadj Ameur and Hassina Aliane},
      year={2021},
      eprint={2105.03143},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Contacts

Hadj Ameur: Github | mohamedhadjameur@gmail.com | mhadjameur@cerist.dz

Downloads last month
3