julien-c HF staff commited on
Commit
d4e7a8c
1 Parent(s): aae1c4f

Migrate model card from transformers-repo

Browse files

Read announcement at https://discuss.huggingface.co/t/announcement-all-model-cards-will-be-migrated-to-hf-co-model-repos/2755
Original file history: https://github.com/huggingface/transformers/commits/master/model_cards/qarib/bert-base-qarib60_1790k/README.md

Files changed (1) hide show
  1. README.md +96 -0
README.md ADDED
@@ -0,0 +1,96 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: ar
3
+ tags:
4
+ - qarib
5
+
6
+ license: apache-2.0
7
+ datasets:
8
+ - Arabic GigaWord
9
+ - Abulkhair Arabic Corpus
10
+ - opus
11
+ - Twitter data
12
+ ---
13
+
14
+ # QARiB: QCRI Arabic and Dialectal BERT
15
+
16
+ ## About QARiB
17
+ QCRI Arabic and Dialectal BERT (QARiB) model, was trained on a collection of ~ 420 Million tweets and ~ 180 Million sentences of text.
18
+ For Tweets, the data was collected using twitter API and using language filter. `lang:ar`. For Text data, it was a combination from
19
+ [Arabic GigaWord](url), [Abulkhair Arabic Corpus]() and [OPUS](http://opus.nlpl.eu/).
20
+
21
+ ### bert-base-qarib60_1790k
22
+ - Data size: 60Gb
23
+ - Number of Iterations: 1790k
24
+ - Loss: 1.8764963
25
+
26
+ ## Training QARiB
27
+ The training of the model has been performed using Google’s original Tensorflow code on Google Cloud TPU v2.
28
+ We used a Google Cloud Storage bucket, for persistent storage of training data and models.
29
+ See more details in [Training QARiB](../Training_QARiB.md)
30
+
31
+ ## Using QARiB
32
+
33
+ You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to be fine-tuned on a downstream task. See the model hub to look for fine-tuned versions on a task that interests you. For more details, see [Using QARiB](../Using_QARiB.md)
34
+
35
+ ### How to use
36
+ You can use this model directly with a pipeline for masked language modeling:
37
+
38
+ ```python
39
+ >>>from transformers import pipeline
40
+ >>>fill_mask = pipeline("fill-mask", model="./models/data60gb_86k")
41
+
42
+ >>> fill_mask("شو عندكم يا [MASK]")
43
+ [{'sequence': '[CLS] شو عندكم يا عرب [SEP]', 'score': 0.0990147516131401, 'token': 2355, 'token_str': 'عرب'},
44
+ {'sequence': '[CLS] شو عندكم يا جماعة [SEP]', 'score': 0.051633741706609726, 'token': 2308, 'token_str': 'جماعة'},
45
+ {'sequence': '[CLS] شو عندكم يا شباب [SEP]', 'score': 0.046871256083250046, 'token': 939, 'token_str': 'شباب'},
46
+ {'sequence': '[CLS] شو عندكم يا رفاق [SEP]', 'score': 0.03598872944712639, 'token': 7664, 'token_str': 'رفاق'},
47
+ {'sequence': '[CLS] شو عندكم يا ناس [SEP]', 'score': 0.031996358186006546, 'token': 271, 'token_str': 'ناس'}]
48
+
49
+ >>> fill_mask("قللي وشفيييك يرحم [MASK]")
50
+ [{'sequence': '[CLS] قللي وشفيييك يرحم والديك [SEP]', 'score': 0.4152909517288208, 'token': 9650, 'token_str': 'والديك'},
51
+ {'sequence': '[CLS] قللي وشفيييك يرحملي [SEP]', 'score': 0.07663793861865997, 'token': 294, 'token_str': '##لي'},
52
+ {'sequence': '[CLS] قللي وشفيييك يرحم حالك [SEP]', 'score': 0.0453166700899601, 'token': 2663, 'token_str': 'حالك'},
53
+ {'sequence': '[CLS] قللي وشفيييك يرحم امك [SEP]', 'score': 0.04390475153923035, 'token': 1942, 'token_str': 'امك'},
54
+ {'sequence': '[CLS] قللي وشفيييك يرحمونك [SEP]', 'score': 0.027349254116415977, 'token': 3283, 'token_str': '##ونك'}]
55
+
56
+ >>> fill_mask("وقام المدير [MASK]")
57
+ [
58
+ {'sequence': '[CLS] وقام المدير بالعمل [SEP]', 'score': 0.0678194984793663, 'token': 4230, 'token_str': 'بالعمل'},
59
+ {'sequence': '[CLS] وقام المدير بذلك [SEP]', 'score': 0.05191086605191231, 'token': 984, 'token_str': 'بذلك'},
60
+ {'sequence': '[CLS] وقام المدير بالاتصال [SEP]', 'score': 0.045264165848493576, 'token': 26096, 'token_str': 'بالاتصال'},
61
+ {'sequence': '[CLS] وقام المدير بعمله [SEP]', 'score': 0.03732728958129883, 'token': 40486, 'token_str': 'بعمله'},
62
+ {'sequence': '[CLS] وقام المدير بالامر [SEP]', 'score': 0.0246378555893898, 'token': 29124, 'token_str': 'بالامر'}
63
+ ]
64
+ >>> fill_mask("وقامت المديرة [MASK]")
65
+
66
+ [{'sequence': '[CLS] وقامت المديرة بذلك [SEP]', 'score': 0.23992691934108734, 'token': 984, 'token_str': 'بذلك'},
67
+ {'sequence': '[CLS] وقامت المديرة بالامر [SEP]', 'score': 0.108805812895298, 'token': 29124, 'token_str': 'بالامر'},
68
+ {'sequence': '[CLS] وقامت المديرة بالعمل [SEP]', 'score': 0.06639821827411652, 'token': 4230, 'token_str': 'بالعمل'},
69
+ {'sequence': '[CLS] وقامت المديرة بالاتصال [SEP]', 'score': 0.05613093823194504, 'token': 26096, 'token_str': 'بالاتصال'},
70
+ {'sequence': '[CLS] وقامت المديرة المديرة [SEP]', 'score': 0.021778125315904617, 'token': 41635, 'token_str': 'المديرة'}]
71
+ ```
72
+ ## Training procedure
73
+
74
+ The training of the model has been performed using Google’s original Tensorflow code on eight core Google Cloud TPU v2.
75
+ We used a Google Cloud Storage bucket, for persistent storage of training data and models.
76
+
77
+ ## Eval results
78
+
79
+ We evaluated QARiB models on five NLP downstream task:
80
+ - Sentiment Analysis
81
+ - Emotion Detection
82
+ - Named-Entity Recognition (NER)
83
+ - Offensive Language Detection
84
+ - Dialect Identification
85
+
86
+ The results obtained from QARiB models outperforms multilingual BERT/AraBERT/ArabicBERT.
87
+
88
+
89
+ ## Model Weights and Vocab Download
90
+ TBD
91
+
92
+ ## Contacts
93
+
94
+ Ahmed Abdelali, Sabit Hassan, Hamdy Mubarak, Kareem Darwish and Younes Samih
95
+
96
+