zohaib99k commited on
Commit
4c2f4d6
1 Parent(s): 1839f80

Upload 7 files

Browse files

Trained on arabic squad-v2

README.md ADDED
@@ -0,0 +1,94 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ datasets:
3
+ - ZeyadAhmed/Arabic-SQuADv2.0
4
+ language:
5
+ - ar
6
+ metrics:
7
+ -
8
+ name: exact_match
9
+ type: exact_match
10
+ value: 65.12
11
+ -
12
+ name: F1
13
+ type: f1
14
+ value: 71.49
15
+
16
+ ---
17
+
18
+ # AraElectra for Question Answering on Arabic-SQuADv2
19
+
20
+ This is the [AraElectra](https://huggingface.co/aubmindlab/araelectra-base-discriminator) model, fine-tuned using the [Arabic-SQuADv2.0](https://huggingface.co/datasets/ZeyadAhmed/Arabic-SQuADv2.0) dataset. It's been trained on question-answer pairs, including unanswerable questions, for the task of Question Answering. with help of [AraElectra Classifier](https://huggingface.co/ZeyadAhmed/AraElectra-Arabic-SQuADv2-CLS) to predicted unanswerable question.
21
+
22
+ ## Overview
23
+ **Language model:** AraElectra <br>
24
+ **Language:** Arabic <br>
25
+ **Downstream-task:** Extractive QA
26
+ **Training data:** Arabic-SQuADv2.0
27
+ **Eval data:** Arabic-SQuADv2.0 <br>
28
+ **Test data:** Arabic-SQuADv2.0 <br>
29
+ **Code:** [See More Info on Github](https://github.com/zeyadahmed10/Arabic-MRC)
30
+ **Infrastructure**: 1x Tesla K80
31
+
32
+ ## Hyperparameters
33
+
34
+ ```
35
+ batch_size = 8
36
+ n_epochs = 4
37
+ base_LM_model = "AraElectra"
38
+ learning_rate = 3e-5
39
+ optimizer = AdamW
40
+ padding = dynamic
41
+ ```
42
+
43
+ ## Online Demo on Arabic Wikipedia and User Provided Contexts
44
+ See model in action hosted on streamlit [![Open in Streamlit](https://static.streamlit.io/badges/streamlit_badge_black_white.svg)](https://share.streamlit.io/wissamantoun/arabic-wikipedia-qa-streamlit/main)
45
+
46
+ ## Usage
47
+ For best results use the AraBert [preprocessor](https://github.com/aub-mind/arabert/blob/master/preprocess.py) by aub-mind
48
+ ```python
49
+ from transformers import ElectraForQuestionAnswering, ElectraForSequenceClassification, AutoTokenizer, pipeline
50
+ from preprocess import ArabertPreprocessor
51
+ prep_object = ArabertPreprocessor("araelectra-base-discriminator")
52
+ question = prep_object('ما هي جامعة الدول العربية ؟')
53
+ context = prep_object('''
54
+ جامعة الدول العربية هيمنظمة إقليمية تضم دولاً عربية في آسيا وأفريقيا.
55
+ ينص ميثاقها على التنسيق بين الدول الأعضاء في الشؤون الاقتصادية، ومن ضمنها العلاقات التجارية الاتصالات، العلاقات الثقافية، الجنسيات ووثائق وأذونات السفر والعلاقات الاجتماعية والصحة. المقر الدائم لجامعة الدول العربية يقع في القاهرة، عاصمة مصر (تونس من 1979 إلى 1990).
56
+ ''')
57
+ # a) Get predictions
58
+ qa_modelname = 'ZeyadAhmed/AraElectra-Arabic-SQuADv2-QA'
59
+ cls_modelname = 'ZeyadAhmed/AraElectra-Arabic-SQuADv2-CLS'
60
+ qa_pipe = pipeline('question-answering', model=qa_modelname, tokenizer=qa_modelname)
61
+ QA_input = {
62
+ 'question': question,
63
+ 'context': context
64
+ }
65
+ CLS_input = {
66
+ 'text': question,
67
+ 'text_pair': context
68
+ }
69
+ qa_res = qa_pipe(QA_input)
70
+ cls_res = cls_pipe(CLS_iput)
71
+ threshold = 0.5 #hyperparameter can be tweaked
72
+ ## note classification results label0 probability it can be answered label1 probability can't be answered
73
+ ## if label1 probability > threshold then consider the output of qa_res is empty string else take the qa_res
74
+ # b) Load model & tokenizer
75
+ qa_model = ElectraForQuestionAnswering.from_pretrained(qa_modelname)
76
+ cls_model = ElectraForSequenceClassification.from_pretrained(cls_modelname)
77
+ tokenizer = AutoTokenizer.from_pretrained(qa_modelname)
78
+ ```
79
+
80
+ ## Performance
81
+ Evaluated on the Arabic-SQuAD 2.0 test set with the [official eval script](https://worksheets.codalab.org/rest/bundles/0x6b567e1cf2e041ec80d7098f031c5c9e/contents/blob/) except changing in the preprocessing a little to fit the arabic language [the modified eval script](https://github.com/zeyadahmed10/Arabic-MRC/blob/main/evaluatev2.py).
82
+
83
+ ```
84
+ "exact": 65.11555277951281,
85
+ "f1": 71.49042547237256,,
86
+
87
+ "total": 9606,
88
+ "HasAns_exact": 56.14535768645358,
89
+ "HasAns_f1": 67.79623803036668,
90
+ "HasAns_total": 5256,
91
+ "NoAns_exact": 75.95402298850574,
92
+ "NoAns_f1": 75.95402298850574,
93
+ "NoAns_total": 4350
94
+ ```
config.json ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "/content/drive/MyDrive/AraElectra-ASQuADv2-QA",
3
+ "architectures": [
4
+ "ElectraForQuestionAnswering"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "classifier_dropout": null,
8
+ "embedding_size": 768,
9
+ "generator_hidden_size": 0.33333,
10
+ "hidden_act": "gelu",
11
+ "hidden_dropout_prob": 0.1,
12
+ "hidden_size": 768,
13
+ "initializer_range": 0.02,
14
+ "intermediate_size": 3072,
15
+ "layer_norm_eps": 1e-12,
16
+ "max_position_embeddings": 512,
17
+ "model_type": "electra",
18
+ "num_attention_heads": 12,
19
+ "num_hidden_layers": 12,
20
+ "pad_token_id": 0,
21
+ "position_embedding_type": "absolute",
22
+ "summary_activation": "gelu",
23
+ "summary_last_dropout": 0.1,
24
+ "summary_type": "first",
25
+ "summary_use_proj": true,
26
+ "torch_dtype": "float32",
27
+ "transformers_version": "4.20.1",
28
+ "type_vocab_size": 2,
29
+ "use_cache": true,
30
+ "vocab_size": 64000
31
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:26c01a5524d2fcc36a627877a8ff9f3f02d5ad00b334a7f387b5def64dd47cff
3
+ size 538485425
special_tokens_map.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": "[CLS]",
3
+ "mask_token": "[MASK]",
4
+ "pad_token": "[PAD]",
5
+ "sep_token": "[SEP]",
6
+ "unk_token": "[UNK]"
7
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": "[CLS]",
3
+ "do_basic_tokenize": true,
4
+ "do_lower_case": false,
5
+ "mask_token": "[MASK]",
6
+ "max_len": 512,
7
+ "name_or_path": "/content/drive/MyDrive/AraElectra-ASQuADv2-QA",
8
+ "never_split": [
9
+ "[بريد]",
10
+ "[مستخدم]",
11
+ "[رابط]"
12
+ ],
13
+ "pad_token": "[PAD]",
14
+ "sep_token": "[SEP]",
15
+ "special_tokens_map_file": null,
16
+ "strip_accents": null,
17
+ "tokenize_chinese_chars": true,
18
+ "tokenizer_class": "ElectraTokenizer",
19
+ "unk_token": "[UNK]"
20
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff