File size: 2,240 Bytes
587b188
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
04febdd
587b188
 
 
 
 
 
 
 
b81ea4c
587b188
b81ea4c
587b188
b81ea4c
587b188
 
b81ea4c
 
587b188
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
---
language: ar
tags:
- pytorch
- tf
- QARiB
- qarib
datasets:
- arabic_billion_words
- open_subtitles
- twitter
- Farasa
metrics:
- f1
widget:
 - text: "و+قام ال+مدير [MASK]"
---
# QARiB: QCRI Arabic and Dialectal BERT
## About QARiB Farasa
QCRI Arabic and Dialectal BERT  (QARiB) model, was trained on a collection of ~ 420 Million tweets and ~ 180 Million sentences of text.
For the tweets, the data was collected using twitter API and using language filter. `lang:ar`. For the text data, it was a combination from
[Arabic GigaWord](url), [Abulkhair Arabic Corpus]() and [OPUS](http://opus.nlpl.eu/).
QARiB: Is the Arabic name for "Boat".
## Model and Parameters:
- Data size: 14B tokens 
- Vocabulary: 64k
- Iterations:  10M
- Number of Layers: 12
## Training QARiB
See details in [Training QARiB](https://github.com/qcri/QARIB/Training_QARiB.md)
## Using QARiB
You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to be fine-tuned on a downstream task. See the model hub to look for fine-tuned versions on a task that interests you. For more details, see [Using QARiB](https://github.com/qcri/QARIB/Using_QARiB.md)

This model expects the data to be segmented. You may use [Farasa Segmenter](https://farasa-api.qcri.org/segmentation/) API.

### How to use
You can use this model directly with a pipeline for masked language modeling:
```python
>>>from transformers import pipeline
>>>fill_mask = pipeline("fill-mask", model="./models/bert-base-qarib_far")
>>> fill_mask("و+قام ال+مدير [MASK]")

>>> fill_mask("و+قام+ت ال+مدير+ة [MASK]")

>>> fill_mask("قللي وشفيييك يرحم [MASK]")

```
## Evaluations:


## Model Weights and Vocab Download
From Huggingface site: https://huggingface.co/qarib/bert-base-qarib_far
## Contacts
Ahmed Abdelali, Sabit Hassan, Hamdy Mubarak, Kareem Darwish and Younes Samih
## Reference
```
@article{abdelali2021pretraining,
    title={Pre-Training BERT on Arabic Tweets: Practical Considerations},
    author={Ahmed Abdelali and Sabit Hassan and Hamdy Mubarak and Kareem Darwish and Younes Samih},
    year={2021},
    eprint={2102.10684},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
```