QARiB: QCRI Arabic and Dialectal BERT

About QARiB

QCRI Arabic and Dialectal BERT (QARiB) model, was trained on a collection of ~ 420 Million tweets and ~ 180 Million sentences of text. For tweets, the data was collected using twitter API and using language filter. lang:ar. For text data, it was a combination from Arabic GigaWord, Abulkhair Arabic Corpus and OPUS.

bert-base-qarib60_860k

  • Data size: 60Gb
  • Number of Iterations: 860k
  • Loss: 2.2454472

Training QARiB

The training of the model has been performed using Googleโ€™s original Tensorflow code on Google Cloud TPU v2. We used a Google Cloud Storage bucket, for persistent storage of training data and models. See more details in Training QARiB

Using QARiB

You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to be fine-tuned on a downstream task. See the model hub to look for fine-tuned versions on a task that interests you. For more details, see Using QARiB

How to use

You can use this model directly with a pipeline for masked language modeling:

>>>from transformers import pipeline
>>>fill_mask = pipeline("fill-mask", model="./models/data60gb_86k")

>>> fill_mask("ุดูˆ ุนู†ุฏูƒู… ูŠุง [MASK]")
[{'sequence': '[CLS] ุดูˆ ุนู†ุฏูƒู… ูŠุง ุนุฑุจ [SEP]', 'score': 0.0990147516131401, 'token': 2355, 'token_str': 'ุนุฑุจ'}, 
{'sequence': '[CLS] ุดูˆ ุนู†ุฏูƒู… ูŠุง ุฌู…ุงุนุฉ [SEP]', 'score': 0.051633741706609726, 'token': 2308, 'token_str': 'ุฌู…ุงุนุฉ'}, 
{'sequence': '[CLS] ุดูˆ ุนู†ุฏูƒู… ูŠุง ุดุจุงุจ [SEP]', 'score': 0.046871256083250046, 'token': 939, 'token_str': 'ุดุจุงุจ'}, 
{'sequence': '[CLS] ุดูˆ ุนู†ุฏูƒู… ูŠุง ุฑูุงู‚ [SEP]', 'score': 0.03598872944712639, 'token': 7664, 'token_str': 'ุฑูุงู‚'}, 
{'sequence': '[CLS] ุดูˆ ุนู†ุฏูƒู… ูŠุง ู†ุงุณ [SEP]', 'score': 0.031996358186006546, 'token': 271, 'token_str': 'ู†ุงุณ'}]

>>> fill_mask("ู‚ู„ู„ูŠ ูˆุดููŠูŠูŠูƒ ูŠุฑุญู… [MASK]")
[{'sequence': '[CLS] ู‚ู„ู„ูŠ ูˆุดููŠูŠูŠูƒ ูŠุฑุญู… ูˆุงู„ุฏูŠูƒ [SEP]', 'score': 0.4152909517288208, 'token': 9650, 'token_str': 'ูˆุงู„ุฏูŠูƒ'}, 
{'sequence': '[CLS] ู‚ู„ู„ูŠ ูˆุดููŠูŠูŠูƒ ูŠุฑุญู…ู„ูŠ [SEP]', 'score': 0.07663793861865997, 'token': 294, 'token_str': '##ู„ูŠ'}, 
{'sequence': '[CLS] ู‚ู„ู„ูŠ ูˆุดููŠูŠูŠูƒ ูŠุฑุญู… ุญุงู„ูƒ [SEP]', 'score': 0.0453166700899601, 'token': 2663, 'token_str': 'ุญุงู„ูƒ'}, 
{'sequence': '[CLS] ู‚ู„ู„ูŠ ูˆุดููŠูŠูŠูƒ ูŠุฑุญู… ุงู…ูƒ [SEP]', 'score': 0.04390475153923035, 'token': 1942, 'token_str': 'ุงู…ูƒ'}, 
{'sequence': '[CLS] ู‚ู„ู„ูŠ ูˆุดููŠูŠูŠูƒ ูŠุฑุญู…ูˆู†ูƒ [SEP]', 'score': 0.027349254116415977, 'token': 3283, 'token_str': '##ูˆู†ูƒ'}]

>>> fill_mask("ูˆู‚ุงู… ุงู„ู…ุฏูŠุฑ [MASK]")
[
{'sequence': '[CLS] ูˆู‚ุงู… ุงู„ู…ุฏูŠุฑ ุจุงู„ุนู…ู„ [SEP]', 'score': 0.0678194984793663, 'token': 4230, 'token_str': 'ุจุงู„ุนู…ู„'}, 
{'sequence': '[CLS] ูˆู‚ุงู… ุงู„ู…ุฏูŠุฑ ุจุฐู„ูƒ [SEP]', 'score': 0.05191086605191231, 'token': 984, 'token_str': 'ุจุฐู„ูƒ'}, 
{'sequence': '[CLS] ูˆู‚ุงู… ุงู„ู…ุฏูŠุฑ ุจุงู„ุงุชุตุงู„ [SEP]', 'score': 0.045264165848493576, 'token': 26096, 'token_str': 'ุจุงู„ุงุชุตุงู„'}, 
{'sequence': '[CLS] ูˆู‚ุงู… ุงู„ู…ุฏูŠุฑ ุจุนู…ู„ู‡ [SEP]', 'score': 0.03732728958129883, 'token': 40486, 'token_str': 'ุจุนู…ู„ู‡'}, 
{'sequence': '[CLS] ูˆู‚ุงู… ุงู„ู…ุฏูŠุฑ ุจุงู„ุงู…ุฑ [SEP]', 'score': 0.0246378555893898, 'token': 29124, 'token_str': 'ุจุงู„ุงู…ุฑ'}
]
>>> fill_mask("ูˆู‚ุงู…ุช ุงู„ู…ุฏูŠุฑุฉ [MASK]")

[{'sequence': '[CLS] ูˆู‚ุงู…ุช ุงู„ู…ุฏูŠุฑุฉ ุจุฐู„ูƒ [SEP]', 'score': 0.23992691934108734, 'token': 984, 'token_str': 'ุจุฐู„ูƒ'}, 
{'sequence': '[CLS] ูˆู‚ุงู…ุช ุงู„ู…ุฏูŠุฑุฉ ุจุงู„ุงู…ุฑ [SEP]', 'score': 0.108805812895298, 'token': 29124, 'token_str': 'ุจุงู„ุงู…ุฑ'}, 
{'sequence': '[CLS] ูˆู‚ุงู…ุช ุงู„ู…ุฏูŠุฑุฉ ุจุงู„ุนู…ู„ [SEP]', 'score': 0.06639821827411652, 'token': 4230, 'token_str': 'ุจุงู„ุนู…ู„'}, 
{'sequence': '[CLS] ูˆู‚ุงู…ุช ุงู„ู…ุฏูŠุฑุฉ ุจุงู„ุงุชุตุงู„ [SEP]', 'score': 0.05613093823194504, 'token': 26096, 'token_str': 'ุจุงู„ุงุชุตุงู„'}, 
{'sequence': '[CLS] ูˆู‚ุงู…ุช ุงู„ู…ุฏูŠุฑุฉ ุงู„ู…ุฏูŠุฑุฉ [SEP]', 'score': 0.021778125315904617, 'token': 41635, 'token_str': 'ุงู„ู…ุฏูŠุฑุฉ'}]

Training procedure

The training of the model has been performed using Googleโ€™s original Tensorflow code on eight core Google Cloud TPU v2. We used a Google Cloud Storage bucket, for persistent storage of training data and models.

Eval results

We evaluated QARiB models on five NLP downstream task:

  • Sentiment Analysis
  • Emotion Detection
  • Named-Entity Recognition (NER)
  • Offensive Language Detection
  • Dialect Identification

The results obtained from QARiB models outperforms multilingual BERT/AraBERT/ArabicBERT.

Model Weights and Vocab Download

From Huggingface site: https://huggingface.co/qarib/bert-base-qarib60_860k

Contacts

Ahmed Abdelali, Sabit Hassan, Hamdy Mubarak, Kareem Darwish and Younes Samih

Reference

@article{abdelali2020qarib,
  title={QARiB: QCRI Arabic and Dialectal BERT},
  author={Ahmed, Abdelali and Sabit, Hassan and Hamdy, Mubarak and Kareem, Darwish and Younes, Samih},
  link={https://github.com/qcri/QARIB},
  year={2020}
}
Downloads last month
11
Hosted inference API
Fill-Mask

Mask token: [MASK]