Ahmed Abdelali commited on
Commit
b6b135c
1 Parent(s): aabc069

push farasa base model

Browse files
Files changed (5) hide show
  1. README.md +84 -0
  2. config.json +19 -0
  3. model.ckpt.index +0 -0
  4. model.ckpt.meta +0 -0
  5. vocab.txt +0 -0
README.md ADDED
@@ -0,0 +1,84 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: ar
3
+ tags:
4
+ - pytorch
5
+ - tf
6
+ - QARiB
7
+ - qarib
8
+ datasets:
9
+ - arabic_billion_words
10
+ - open_subtitles
11
+ - twitter
12
+ - Farasa
13
+ metrics:
14
+ - f1
15
+ widget:
16
+ - text: "و+قام ال+مدير [MASK]"
17
+ ---
18
+ # QARiB: QCRI Arabic and Dialectal BERT
19
+ ## About QARiB Farasa
20
+ QCRI Arabic and Dialectal BERT (QARiB) model, was trained on a collection of ~ 420 Million tweets and ~ 180 Million sentences of text.
21
+ For the tweets, the data was collected using twitter API and using language filter. `lang:ar`. For the text data, it was a combination from
22
+ [Arabic GigaWord](url), [Abulkhair Arabic Corpus]() and [OPUS](http://opus.nlpl.eu/).
23
+ QARiB: Is the Arabic name for "Boat".
24
+ ## Model and Parameters:
25
+ - Data size: 14B tokens
26
+ - Vocabulary: 64k
27
+ - Iterations: 10M
28
+ - Number of Layers: 12
29
+ ## Training QARiB
30
+ See details in [Training QARiB](https://github.com/qcri/QARIB/Training_QARiB.md)
31
+ ## Using QARiB
32
+ You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to be fine-tuned on a downstream task. See the model hub to look for fine-tuned versions on a task that interests you. For more details, see [Using QARiB](https://github.com/qcri/QARIB/Using_QARiB.md)
33
+
34
+ This model expects the data to be segmented. You may use [Farasa Segmenter](https://farasa-api.qcri.org/segmentation/) API.
35
+
36
+ ### How to use
37
+ You can use this model directly with a pipeline for masked language modeling:
38
+ ```python
39
+ >>>from transformers import pipeline
40
+ >>>fill_mask = pipeline("fill-mask", model="./models/bert-base-qarib_far")
41
+ >>> fill_mask("و+قام ال+مدير [MASK]")
42
+ [
43
+ {'sequence': '[CLS] وقام المدير بالعمل [SEP]', 'score': 0.0678194984793663, 'token': 4230, 'token_str': 'بالعمل'},
44
+ {'sequence': '[CLS] وقام المدير بذلك [SEP]', 'score': 0.05191086605191231, 'token': 984, 'token_str': 'بذلك'},
45
+ {'sequence': '[CLS] وقام المدير بالاتصال [SEP]', 'score': 0.045264165848493576, 'token': 26096, 'token_str': 'بالاتصال'},
46
+ {'sequence': '[CLS] وقام المدير بعمله [SEP]', 'score': 0.03732728958129883, 'token': 40486, 'token_str': 'بعمله'},
47
+ {'sequence': '[CLS] وقام المدير بالامر [SEP]', 'score': 0.0246378555893898, 'token': 29124, 'token_str': 'بالامر'}
48
+ ]
49
+ >>> fill_mask("و+قام+ت ال+مدير+ة [MASK]")
50
+ [{'sequence': '[CLS] وقامت المديرة بذلك [SEP]', 'score': 0.23992691934108734, 'token': 984, 'token_str': 'بذلك'},
51
+ {'sequence': '[CLS] وقامت المديرة بالامر [SEP]', 'score': 0.108805812895298, 'token': 29124, 'token_str': 'بالامر'},
52
+ {'sequence': '[CLS] وقامت المديرة بالعمل [SEP]', 'score': 0.06639821827411652, 'token': 4230, 'token_str': 'بالعمل'},
53
+ {'sequence': '[CLS] وقامت المديرة بالاتصال [SEP]', 'score': 0.05613093823194504, 'token': 26096, 'token_str': 'بالاتصال'},
54
+ {'sequence': '[CLS] وقامت المديرة المديرة [SEP]', 'score': 0.021778125315904617, 'token': 41635, 'token_str': 'المديرة'}]
55
+ >>> fill_mask("قللي وشفيييك يرحم [MASK]")
56
+ [{'sequence': '[CLS] قللي وشفيييك يرحم والديك [SEP]', 'score': 0.4152909517288208, 'token': 9650, 'token_str': 'والديك'},
57
+ {'sequence': '[CLS] قللي وشفيييك يرحملي [SEP]', 'score': 0.07663793861865997, 'token': 294, 'token_str': '##لي'},
58
+ {'sequence': '[CLS] قللي وشفيييك يرحم حالك [SEP]', 'score': 0.0453166700899601, 'token': 2663, 'token_str': 'حالك'},
59
+ {'sequence': '[CLS] قللي وشفيييك يرحم امك [SEP]', 'score': 0.04390475153923035, 'token': 1942, 'token_str': 'امك'},
60
+ {'sequence': '[CLS] قللي وشفيييك يرحمونك [SEP]', 'score': 0.027349254116415977, 'token': 3283, 'token_str': '##ونك'}]
61
+ ```
62
+ ## Evaluations:
63
+ |**Experiment** |**mBERT**|**AraBERT0.1**|**AraBERT1.0**|**ArabicBERT**|**QARiB**|
64
+ |---------------|---------|--------------|--------------|--------------|---------|
65
+ |Dialect Identification | 6.06% | 59.92% | 59.85% | 61.70% | **65.21%** |
66
+ |Emotion Detection | 27.90% | 43.89% | 42.37% | 41.65% | **44.35%** |
67
+ |Named-Entity Recognition (NER) | 49.38% | 64.97% | **66.63%** | 64.04% | 61.62% |
68
+ |Offensive Language Detection | 83.14% | 88.07% | 88.97% | 88.19% | **91.94%** |
69
+ |Sentiment Analysis | 86.61% | 90.80% | **93.58%** | 83.27% | 93.31% |
70
+ ## Model Weights and Vocab Download
71
+ From Huggingface site: https://huggingface.co/qarib/bert-base-qarib_far
72
+ ## Contacts
73
+ Ahmed Abdelali, Sabit Hassan, Hamdy Mubarak, Kareem Darwish and Younes Samih
74
+ ## Reference
75
+ ```
76
+ @article{abdelali2021pretraining,
77
+ title={Pre-Training BERT on Arabic Tweets: Practical Considerations},
78
+ author={Ahmed Abdelali and Sabit Hassan and Hamdy Mubarak and Kareem Darwish and Younes Samih},
79
+ year={2021},
80
+ eprint={2102.10684},
81
+ archivePrefix={arXiv},
82
+ primaryClass={cs.CL}
83
+ }
84
+ ```
config.json ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "attention_probs_dropout_prob": 0.1,
3
+ "directionality": "bidi",
4
+ "hidden_act": "gelu",
5
+ "hidden_dropout_prob": 0.1,
6
+ "hidden_size": 768,
7
+ "initializer_range": 0.02,
8
+ "intermediate_size": 3072,
9
+ "max_position_embeddings": 512,
10
+ "num_attention_heads": 12,
11
+ "num_hidden_layers": 12,
12
+ "pooler_fc_size": 768,
13
+ "pooler_num_attention_heads": 12,
14
+ "pooler_num_fc_layers": 3,
15
+ "pooler_size_per_head": 128,
16
+ "pooler_type": "first_token_transform",
17
+ "type_vocab_size": 2,
18
+ "vocab_size": 64000
19
+ }
model.ckpt.index ADDED
Binary file (9.38 kB). View file
model.ckpt.meta ADDED
Binary file (4.71 MB). View file
vocab.txt ADDED
The diff for this file is too large to render. See raw diff