KadriMufti commited on
Commit
965c7e3
1 Parent(s): 477a4c1

Upload 9 files

Browse files
README.md ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## Purpose:
2
+
3
+ This model is a query classifer for the Arabic Language. It returns a 0 for a query of words and 1 for a fully-formed question.
4
+
5
+ It was built in three steps.
6
+
7
+ 1. Take the same useful Kaggle training data that Sharukh used, and only take the 'dev.csv' data, which is more than sufficient. Split that later into a new set of trian, val, and test sets. Translate it into Arabic using the Seq2Seq translation model "facebook/m2m100_1.2B". The priority was to have syntactially correct translations, and not necessarily semantically correct. In that sense, for word queries the words were translated individually and recombined into one string. The questions were translated as-is, and sometimes the results were a mix of Arabic and English (this is, I think, due to the details of the m2m model's vocab size and tokenizer). About 28% of the training data had question marks written explicitly.
8
+
9
+ 2. Use the model [ARBERT](https://huggingface.co/UBC-NLP/ARBERT) as the base, and finetune on the above data.
10
+
11
+ 3. Distill the above model into a smaller size. I was not very succesful in reducing the size significaly, although I reduced the hidden layers from 12 to 4.
12
+
13
+
14
+ Results of testing on distilled model:
15
+
16
+ {'accuracy': 0.9812329107631121,
17
+ 'precision': 0.9833664349553128,
18
+ 'recall': 0.9792336217552534,
19
+ 'roc_auc': 0.98124390410432,
20
+ 'f1': 0.9812956769478509,
21
+ 'matthews': 0.9624741598127332,
22
+ 'mse': 0.018767089236887895,
23
+ 'brier': 0.018767089236887895}
24
+
25
+
26
+ ## Thanks:
27
+
28
+ This model was inspired by this Github [thread]https://github.com/deepset-ai/haystack/issues/611) wherein making a query classifer model is discussed, and also [Sharukh Khan's] (https://github.com/shahrukhx01) resulting English model based on DistilBert.
29
+
30
+ Regarding the model distillation, I owe thanks to the following source:
31
+
32
+ [Knowledge Distillation article by Phil Schmid](https://www.philschmid.de/knowledge-distillation-bert-transformers)
33
+
34
+ Articles by Remi Reboul:
35
+
36
+ https://towardsdatascience.com/distillation-of-bert-like-models-the-theory-32e19a02641f
37
+
38
+ https://towardsdatascience.com/distillation-of-bert-like-models-the-code-73c31e8c2b0a
39
+
40
+
41
+
fingerprint.pb ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:06d7d1dc9501deb3e07d530bd9df67f51cf4f44836a78d6d72f6f7a1e7801936
3
+ size 54
keras_metadata.pb ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bd017f0815a10788f3b9632699b0183138f3be61744f1a9870dc40af5774be58
3
+ size 65206
saved_model.pb ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c1e0f94554c9dfd0338dfd5227642814901c7c1dea9ee162a6fc9302942d0f55
3
+ size 2943124
special_tokens_map.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": "[CLS]",
3
+ "mask_token": "[MASK]",
4
+ "pad_token": "[PAD]",
5
+ "sep_token": "[SEP]",
6
+ "unk_token": "[UNK]"
7
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": "[CLS]",
3
+ "do_basic_tokenize": true,
4
+ "do_lower_case": true,
5
+ "mask_token": "[MASK]",
6
+ "model_max_length": 1000000000000000019884624838656,
7
+ "name_or_path": "UBC-NLP/ARBERT",
8
+ "never_split": null,
9
+ "pad_token": "[PAD]",
10
+ "sep_token": "[SEP]",
11
+ "special_tokens_map_file": null,
12
+ "strip_accents": null,
13
+ "tokenize_chinese_chars": true,
14
+ "tokenizer_class": "BertTokenizer",
15
+ "unk_token": "[UNK]"
16
+ }
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:67d320e32bfafbad2636883057fb568bda6bc2a821ba0370eee48c678655bff7
3
+ size 3707
vocab.txt ADDED
The diff for this file is too large to render. See raw diff