File size: 7,136 Bytes
c51d22f
289b600
 
 
 
 
 
 
 
 
d4fad95
 
 
 
 
a3dae86
d4fad95
68d1c9e
d4fad95
 
 
 
 
 
 
 
 
 
 
 
 
 
 
289b600
 
d4fad95
c51d22f
 
 
 
289b600
 
 
 
 
 
 
 
c51d22f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
289b600
c51d22f
 
 
289b600
c51d22f
 
 
 
 
 
 
 
 
 
 
 
 
 
289b600
 
c51d22f
 
 
 
 
 
 
 
289b600
c51d22f
289b600
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a8dc020
289b600
a8dc020
289b600
 
 
 
 
 
 
 
 
 
 
 
c51d22f
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143

---
license: apache-2.0
datasets:
- mrqa
language:
- en
metrics:
- exact_match
- f1

model-index:
- name: VMware/TinyRoBERTa-MRQA
  results:
  - task:
      type: Question-Answering            
    dataset:
      type: mrqa-2019       # Required. Example: common_voice. Use dataset id from https://hf.co/datasets
      name: mrqa       # Required. A pretty name for the dataset. Example: Common Voice (French)

    metrics:
      - type: exact_match       # Required. Example: wer. Use metric id from https://hf.co/metrics
        value: 69.38       # Required. Example: 20.90
        name: Eval EM         # Optional. Example: Test WER
      - type: f1         # Required. Example: wer. Use metric id from https://hf.co/metrics
        value: 80.07      # Required. Example: 20.90
        name: Eval F1         # Optional. Example: Test WER
      - type: exact_match         # Required. Example: wer. Use metric id from https://hf.co/metrics
        value: 53.29       # Required. Example: 20.90
        name: Test EM         # Optional. Example: Test WER
      - type: f1        # Required. Example: wer. Use metric id from https://hf.co/metrics
        value: 64.16      # Required. Example: 20.90
        name: Test F1         # Optional. Example: Test WER
---

# TinyRoBERTa-MRQA

This is the *distilled* version of the [VMware/roberta-large-mrqa](https://huggingface.co/VMware/roberta-large-mrqa) model. This model has a comparable prediction quality to the base model and runs twice as fast.

## Overview
- **Model name:** tinyroberta-mrqa 
- **Model type:** Extractive Question Answering
- **Teacher Model:** [VMware/roberta-large-mrqa](https://huggingface.co/VMware/roberta-large-mrqa)
- **Training dataset:** [MRQA](https://huggingface.co/datasets/mrqa) (Machine Reading for Question Answering)
- **Training data size:** 516,819 examples
- **Language:** English
- **Framework:** PyTorch
- **Model version:** 1.0 

## Hyperparameters

### Distillation Hyperparameters
```
batch_size = 96
n_epochs = 4
base_LM_model = "deepset/tinyroberta-squad2-step1"
max_seq_len = 384
learning_rate = 3e-5
lr_schedule = LinearWarmup
warmup_proportion = 0.2
doc_stride = 128
max_query_length = 64
distillation_loss_weight = 0.75
temperature = 1.5
teacher = "VMware/roberta-large-mrqa"
```
### Finetunning Hyperparameters

We have finetuned on the MRQA training set.
```
    learning_rate=1e-5,
    num_train_epochs=3,
    weight_decay=0.01,
    per_device_train_batch_size=16,
    n_gpus = 1
```

## Distillation
This model is inspired by [deepset/tinyroberta-squad2](https://huggingface.co/deepset/tinyroberta-squad2) and the TinyBERT paper. 
We start with a base checkpoint of [deepset/roberta-base-squad2](https://huggingface.co/deepset/roberta-base-squad2) and perform further task prediction layer distillation on [VMware/roberta-large-mrqa](https://huggingface.co/VMware/roberta-large-mrqa).
We then fine-tune it on MRQA. 

## Usage

### In Transformers
```python
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, pipeline

model_name = "VMware/tinyroberta-mrqa"

# a) Get predictions
nlp = pipeline('question-answering', model=model_name, tokenizer=model_name)
QA_input = {
context = "We present the results of the Machine Reading for Question Answering (MRQA) 2019 shared task on evaluating the generalization capabilities of reading comprehension systems. In this task, we adapted and unified 18 distinct question answering datasets into the same format. Among them, six datasets were made available for training, six datasets were made available for development, and the final six were hidden for final evaluation. Ten teams submitted systems, which explored various ideas including data sampling, multi-task learning, adversarial training and ensembling. The best system achieved an average F1 score of 72.5 on the 12 held-out datasets, 10.7 absolute points higher than our initial baseline based on BERT."
question = "What is MRQA?"
}
res = nlp(QA_input)

# b) Load model & tokenizer
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
```

# Model Family Performance

| Parent Language Model | Number of Parameters | Training Time | Eval Time | Test Time | Eval EM | Eval F1 | Test EM | Test F1 |
|---|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|
| BERT-Tiny | 4,369,666 | 26:11 | 0:41 | 0:04 | 22.78 | 32.42 | 10.18 | 18.72 |
| BERT-Base | 108,893,186 | 8:39:10 | 18:42 | 2:13 | 64.48 | 76.14 | 48.89 | 59.89 |
| BERT-Large | 334,094,338 | 28:35:38 | 1:00:56 | 7:14 | 69.52 | 80.50 | 55.00 | 65.78 |
| DeBERTa-v3-Extra-Small | 70,682,882 | 5:19:05 | 11:29 | 1:16 | 65.58 | 77.17 | 50.92 | 62.58 |
| DeBERTa-v3-Base | 183,833,090 | 12:13:41 | 28:18 | 3:09 | 71.43 | 82.59 | 59.49 | 70.46 |
| DeBERTa-v3-Large | 434,014,210 | 38:36:13 | 1:25:47 | 9:33 | **76.08** | **86.23** | **64.27** | **75.22** |
| ELECTRA-Small | 13,483,522 | 2:16:36 | 3:55 | 0:27 | 57.63 | 69.38 | 38.68 | 51.56 |
| ELECTRA-Base | 108,893,186 | 8:40:57 | 18:41 | 2:12 | 68.78 | 80.16 | 54.70 | 65.80 |
| ELECTRA-Large-Discriminator | 334,094,338 | 28:31:59 | 1:00:40 | 7:13 | 74.15 | 84.96 | 62.35 | 73.28 |
| MiniLMv2-L6-H384-from-BERT-Large | 22,566,146 | 2:12:48 | 4:23 | 0:40 | 59.31 | 71.09 | 41.78 | 53.30 |
| MiniLMv2-L6-H768-from-BERT-Large | 66,365,954 | 4:42:59 | 10:01 | 1:10 | 64.27 | 75.84 | 49.05 | 59.82 |
| MiniLMv2-L6-H384-from-RoBERTa-Large | 30,147,842 | 2:15:10 | 4:19 | 0:30 | 59.27 | 70.64 | 42.95 | 54.03 |
| MiniLMv2-L12-H384-from-RoBERTa-Large | 40,794,626 | 4:14:22 | 8:27 | 0:58 | 64.58 | 76.23 | 51.28 | 62.83 |
| MiniLMv2-L6-H768-from-RoBERTa-Large | 81,529,346 | 4:39:02 | 9:34 | 1:06 | 65.80 | 77.17 | 51.72 | 63.27 |
| RoBERTa-Base | 124,056,578 | 8:50:29 | 18:59 | 2:11 | 69.06 | 80.08 | 55.53 | 66.49 |
| RoBERTa-Large | 354,312,194 | 29:16:06 | 1:01:10 | 7:04 | 74.08 | 84.38 | 62.20 | 72.88 |
|TinyRoBERTa |  81,529.346 | 4:27:06 *| 9:54 | 1:04 | 69.38 | 80.07| 53.29| 64.16|

\*: Training times aren't perfectly comparable as TinyRoBERTa was distilled from [VMware/roberta-large-mrqa](https://huggingface.co/VMware/roberta-large-mrqa) that was already trained on MRQA

# Limitations and Bias

The model is based on a large and diverse dataset, but it may still have limitations and biases in certain areas. Some limitations include:

- Language: The model is designed to work with English text only and may not perform as well on other languages.

- Domain-specific knowledge: The model has been trained on a general dataset and may not perform well on questions that require domain-specific knowledge.

- Out-of-distribution questions: The model may struggle with questions that are outside the scope of the MRQA dataset.  This is best demonstrated by the delta between its scores on the eval vs test datasets.

In addition, the model may have some bias in terms of the data it was trained on. The dataset includes questions from a variety of sources, but it may not be representative of all populations or perspectives. As a result, the model may perform better or worse for certain types of questions or on certain types of texts.



```