rbattle commited on
Commit
44493f4
1 Parent(s): fe80d3e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +93 -0
README.md CHANGED
@@ -1,3 +1,96 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ datasets:
4
+ - mrqa
5
+ language:
6
+ - en
7
+ metrics:
8
+ - exact_match
9
+ - f1
10
  ---
11
+
12
+
13
+ This model release is part of a joint research project with Howard University.
14
+
15
+
16
+ # Model Details
17
+
18
+ - **Model name:** BERT-Base-MRQA
19
+ - **Model type:** Extractive Question Answering
20
+ - **Parent Model:** [BERT-Base-uncased](https://huggingface.co/bert-base-uncased)
21
+ - **Training dataset:** [MRQA](https://huggingface.co/datasets/mrqa) (Machine Reading for Question Answering)
22
+ - **Training data size:** 516,819 examples
23
+ - **Training time:** 8:39:10 on 1 Nvidia V100 32GB GPU
24
+ - **Language:** English
25
+ - **Framework:** PyTorch
26
+ - **Model version:** 1.0
27
+
28
+
29
+ # Intended Use
30
+
31
+ This model is intended to provide accurate answers to questions based on context passages. It can be used for a variety of tasks, including question-answering for search engines, chatbots, customer service systems, and other applications that require natural language understanding.
32
+
33
+
34
+ # Training Details
35
+
36
+ The model was trained for 1 epoch on the MRQA training set.
37
+
38
+
39
+ ## Training Hyperparameters
40
+
41
+ ```python
42
+ args = TrainingArguments(
43
+ "bert-base-mrqa",
44
+ save_strategy="epoch",
45
+ learning_rate=1e-5,
46
+ num_train_epochs=1,
47
+ weight_decay=0.01,
48
+ per_device_train_batch_size=16,
49
+ )
50
+ ```
51
+
52
+
53
+ # Evaluation Metrics
54
+
55
+ The model was evaluated using standard metrics for question-answering models, including:
56
+
57
+ Exact match (EM): The percentage of questions for which the model produces an exact match with the ground truth answer.
58
+
59
+ F1 score: A weighted average of precision and recall, which measures the overlap between the predicted answer and the ground truth answer.
60
+
61
+
62
+ # Model Family Performance
63
+
64
+ | Parent Language Model | Number of Parameters | Training Time | Eval Time | Test Time | Eval EM | Eval F1 | Test EM | Test F1 |
65
+ |---|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
66
+ | BERT-Tiny | 4,369,666 | 26:11 | 0:41 | 0:04 | 22.78 | 32.42 | 10.18 | 18.72 |
67
+ | BERT-Base | 108,893,186 | 8:39:10 | 18:42 | 2:13 | 64.48 | 76.14 | 48.89 | 59.89 |
68
+ | BERT-Large | 334,094,338 | 28:35:38 | 1:00:56 | 7:14 | 69.52 | 80.50 | 55.00 | 65.78 |
69
+ | DeBERTa-v3-Extra-Small | 70,682,882 | 5:19:05 | 11:29 | 1:16 | 65.58 | 77.17 | 50.92 | 62.58 |
70
+ | DeBERTa-v3-Base | 183,833,090 | 12:13:41 | 28:18 | 3:09 | 71.43 | 82.59 | 59.49 | 70.46 |
71
+ | DeBERTa-v3-Large | 434,014,210 | 38:36:13 | 1:25:47 | 9:33 | **76.08** | **86.23** | **64.27** | **75.22** |
72
+ | ELECTRA-Small | 13,483,522 | 2:16:36 | 3:55 | 0:27 | 57.63 | 69.38 | 38.68 | 51.56 |
73
+ | ELECTRA-Base | 108,893,186 | 8:40:57 | 18:41 | 2:12 | 68.78 | 80.16 | 54.70 | 65.80 |
74
+ | ELECTRA-Large-Discriminator | 334,094,338 | 28:31:59 | 1:00:40 | 7:13 | 74.15 | 84.96 | 62.35 | 73.28 |
75
+ | MiniLMv2-L6-H384-from-BERT-Large | 22,566,146 | 2:12:48 | 4:23 | 0:40 | 59.31 | 71.09 | 41.78 | 53.30 |
76
+ | MiniLMv2-L6-H768-from-BERT-Large | 66,365,954 | 4:42:59 | 10:01 | 1:10 | 64.27 | 75.84 | 49.05 | 59.82 |
77
+ | MiniLMv2-L6-H384-from-RoBERTa-Large | 30,147,842 | 2:15:10 | 4:19 | 0:30 | 59.27 | 70.64 | 42.95 | 54.03 |
78
+ | MiniLMv2-L12-H384-from-RoBERTa-Large | 40,794,626 | 4:14:22 | 8:27 | 0:58 | 64.58 | 76.23 | 51.28 | 62.83 |
79
+ | MiniLMv2-L6-H768-from-RoBERTa-Large | 81,529,346 | 4:39:02 | 9:34 | 1:06 | 65.80 | 77.17 | 51.72 | 63.27 |
80
+ | RoBERTa-Base | 124,056,578 | 8:50:29 | 18:59 | 2:11 | 69.06 | 80.08 | 55.53 | 66.49 |
81
+ | RoBERTa-Large | 354,312,194 | 29:16:06 | 1:01:10 | 7:04 | 74.08 | 84.38 | 62.20 | 72.88 |
82
+
83
+
84
+ # Limitations and Bias
85
+
86
+ The model is based on a large and diverse dataset, but it may still have limitations and biases in certain areas. Some limitations include:
87
+
88
+ - Language: The model is designed to work with English text only and may not perform as well on other languages.
89
+
90
+ - Domain-specific knowledge: The model has been trained on a general dataset and may not perform well on questions that require domain-specific knowledge.
91
+
92
+ - Out-of-distribution questions: The model may struggle with questions that are outside the scope of the MRQA dataset. This is best demonstrated by the delta between its scores on the eval vs test datasets.
93
+
94
+ In addition, the model may have some bias in terms of the data it was trained on. The dataset includes questions from a variety of sources, but it may not be representative of all populations or perspectives. As a result, the model may perform better or worse for certain types of questions or on certain types of texts.
95
+
96
+