File size: 9,057 Bytes
3e79479
 
03745f9
 
 
 
 
37cd8ca
c0bb917
37cd8ca
 
 
 
 
 
 
03745f9
 
 
 
 
 
 
91e8d4a
03745f9
 
 
 
 
 
 
 
 
 
3e79479
03745f9
ac472cf
 
 
 
37cd8ca
 
 
ac472cf
 
 
 
 
47c836f
37cd8ca
ac472cf
37cd8ca
ac472cf
 
37cd8ca
47c836f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0dfc731
 
47c836f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1891c96
47c836f
 
 
 
37cd8ca
 
 
47c836f
 
 
 
37cd8ca
ac472cf
 
 
 
47c836f
37cd8ca
47c836f
37cd8ca
47c836f
 
 
 
 
 
 
 
 
 
ac472cf
37cd8ca
47c836f
 
37cd8ca
 
 
ac472cf
 
 
 
 
 
 
 
 
30575fe
ac472cf
 
 
 
30575fe
ac472cf
30575fe
ac472cf
 
 
 
 
 
 
30575fe
ac472cf
 
 
 
37cd8ca
 
0dfc731
ac472cf
 
47c836f
 
 
ac472cf
 
47c836f
 
37cd8ca
47c836f
 
 
37cd8ca
 
ac472cf
37cd8ca
47c836f
 
 
ac472cf
47c836f
ac472cf
 
 
47c836f
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
---
license: apache-2.0
tags:
- question-answering
- squad
- transformers
- nlp
datasets:
- squad
language:
- en
metrics:
- exact_match
- f1
library_name: transformers
pipeline_tag: question-answering
model-index:
- name: roberta-base-qa-v1
  results:
  - task:
      type: question-answering
      name: question-answering
    dataset:
      name: squad (a subset, not official dataset)
      type: squad
    metrics:
    - type: f1
      value: 78.28
      name: f1
      verified: false
    - type: exact-match
      value: 66.00
      name: exact-match
      verified: false
---

# Model card for SaraPiscitelli/roberta-base-qa-v1
This model is a **finetuned** model starting from the base transformer model  [roberta-base](https://huggingface.co/roberta-base).   
This model is finetuned on **extractive question answering** task using [squad dataset](https://huggingface.co/datasets/squad).  
You can access the training code [here](https://github.com/sarapiscitelli/nlp-tasks/blob/main/scripts/train/question_answering.py) and the evaluation code [here](https://github.com/sarapiscitelli/nlp-tasks/blob/main/scripts/evaluation/question_answering.py).  

### Model Description

- **Developed by:**  Sara Piscitelli
- **Model type:** Transformer Encoder - RobertaBaseForQuestionAnswering (124.056.578 params)
- **Language(s) (NLP):** English
- **License:** Apache 2.0
- **Finetuned from model:** [roberta-base](https://huggingface.co/roberta-base)
- **Maximum input tokens:** 512

### Model Sources 

- **training code:** [here](https://github.com/sarapiscitelli/nlp-tasks/blob/main/scripts/train/question_answering.py)
- **evaluation code:** [here](https://github.com/sarapiscitelli/nlp-tasks/blob/main/scripts/evaluation/question_answering.py).  

## Uses
The model can be utilized for the extractive question-answering task, where both the context and the question are provide.  

### Recommendations
This is a basic standard model; some results may be inaccurate.  
Refer to the evaluation metrics for a better understanding of its performance.

## How to Get Started with the Model

You can use the Huggingface pipeline:   
```
from transformers import pipeline

qa_model = pipeline("question-answering", model="SaraPiscitelli/roberta-base-qa-v1")

question = "Which name is also used to describe the Amazon rainforest in English?"
context = """The Amazon rainforest (Portuguese: Floresta Amaz么nica or Amaz么nia; Spanish: Selva Amaz贸nica, Amazon铆a or usually Amazonia; French: For锚t amazonienne; Dutch: Amazoneregenwoud), also known in English as Amazonia or the Amazon Jungle, is a moist broadleaf forest that covers most of the Amazon basin of South America. This basin encompasses 7,000,000 square kilometres (2,700,000 sq mi), of which 5,500,000 square kilometres (2,100,000 sq mi) are covered by the rainforest. This region includes territory belonging to nine nations. The majority of the forest is contained within Brazil, with 60% of the rainforest, followed by Peru with 13%, Colombia with 10%, and with minor amounts in Venezuela, Ecuador, Bolivia, Guyana, Suriname and French Guiana. States or departments in four nations contain "Amazonas" in their names. The Amazon represents over half of the planet's remaining rainforests, and comprises the largest and most biodiverse tract of tropical rainforest in the world, with an estimated 390 billion individual trees divided into 16,000 species."""
print(qa_model(question = question, context = context)['answer'])
```
or load it directly:   
```
import torch

from typing import List, Optional
from transformers import AutoModelForQuestionAnswering, AutoTokenizer

class InferenceModel:

    def __init__(self, model_name_or_checkpoin_path: str,
                 tokenizer_name: Optional[str] = None,
                 device_type: Optional[str] = None) -> List[str]:
        if tokenizer_name is None:
            tokenizer_name = model_name_or_checkpoin_path
        if device_type is None:
            device_type = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"

        self.model = AutoModelForQuestionAnswering.from_pretrained(model_name_or_checkpoin_path, device_map=device_type)
        self.model.eval()
        self.tokenizer = AutoTokenizer.from_pretrained(model_name_or_checkpoin_path)

    def inference(self, questions: List[str], contexts: List[str]) -> List[str]:
        inputs = self.tokenizer(questions, contexts,
                                padding="longest",
                                return_tensors="pt").to(self.model.device)
        with torch.no_grad():
            logits = self.model(**inputs)
        # logits.start_logits.shape == (batch_size, input_length) = inputs['input_ids'].shape
        # logits.end_logits.shape == (batch_size, input_length) = inputs['input_ids'].shape
        answer_start_index: List[int] = logits.start_logits.argmax(dim=-1).tolist()
        answer_end_index: List[int] = logits.end_logits.argmax(dim=-1).tolist()
        answer_tokens: List[str] = [self.tokenizer.decode(inputs.input_ids[i, answer_start_index[i] : answer_end_index[i] + 1])
                                    for i in range(len(questions))]
        return answer_tokens


model = InferenceModel("SaraPiscitelli/roberta-base-qa-v1")
question = "Which name is also used to describe the Amazon rainforest in English?"
context = """The Amazon rainforest (Portuguese: Floresta Amaz么nica or Amaz么nia; Spanish: Selva Amaz贸nica, Amazon铆a or usually Amazonia; French: For锚t amazonienne; Dutch: Amazoneregenwoud), also known in English as Amazonia or the Amazon Jungle, is a moist broadleaf forest that covers most of the Amazon basin of South America. This basin encompasses 7,000,000 square kilometres (2,700,000 sq mi), of which 5,500,000 square kilometres (2,100,000 sq mi) are covered by the rainforest. This region includes territory belonging to nine nations. The majority of the forest is contained within Brazil, with 60% of the rainforest, followed by Peru with 13%, Colombia with 10%, and with minor amounts in Venezuela, Ecuador, Bolivia, Guyana, Suriname and French Guiana. States or departments in four nations contain "Amazonas" in their names. The Amazon represents over half of the planet's remaining rainforests, and comprises the largest and most biodiverse tract of tropical rainforest in the world, with an estimated 390 billion individual trees divided into 16,000 species."""
print(model.inference(questions=[question], contexts=[context])[0])
```
In both cases, the answer will be printed out:  "Amazonia or the Amazon Jungle"

## Training Details

### Training Data
- [squad dataset](https://huggingface.co/datasets/squad).   
To retrieve the dataset, use the following code:
```
from datasets import load_dataset

squad = load_dataset("squad")
squad['train'] = squad['train'].select(range(30000))
squad['test'] = squad['validation']
squad['validation'] = squad['validation'].select(range(2000))
```

The dataset used after preprocessing is listed below:  

- Train Dataset({   
      features: ['id', 'title', 'context', 'question', 'answers'],   
      num_rows: 8207   
  })   

- Validation dataset({   
      features: ['id', 'title', 'context', 'question', 'answers'],   
      num_rows: 637   
  })   
   
#### Preprocessing

All samples with **more than 512 tokens have been removed**.  
This was necessary due to the maximum input token limit accepted by the RoBERTa-base model.

#### Training Hyperparameters

- **Training regime:** fp32
- **base_model_name_or_path:** roberta-base
- **max_tokens_length:** 512
- **training_arguments:** TrainingArguments(
    output_dir=results_dir,
    num_train_epochs=5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    gradient_accumulation_steps=1,
    learning_rate=0.00001,
    lr_scheduler_type="linear",
    optim="adamw_torch",
    eval_accumulation_steps=1,
    evaluation_strategy="steps",
    eval_steps=0.2,
    save_strategy="steps",
    save_steps=0.2,
    logging_strategy="steps",
    logging_steps=1,
    report_to="tensorboard",
    do_train=True,
    do_eval=True,
    max_grad_norm=0.3,
    warmup_ratio=0.03,
    #group_by_length=True,
    dataloader_drop_last=False,
    fp16=False,
    bf16=False
)


### Testing Data & Evaluation Metrics

#### Testing Data
To retrieve the dataset, use the following code:
```
from datasets import load_dataset

squad = load_dataset("squad")
squad['test'] = squad['validation'] 
```  

Test Dataset({   
    features: ['id', 'title', 'context', 'question', 'answers'],   
    num_rows: 10570   
})

#### Metrics

To evaluate model has been used the standard metric for squad:  
```
import evaluate
metric_eval = evaluate.load("squad_v2")
```

### Results

{'exact-match': 66.00660066006601,  
'f1': 78.28040573606134,  
'total': 909,  
'HasAns_exact': 66.00660066006601,  
'HasAns_f1': 78.28040573606134,  
'HasAns_total': 909,   
'best_exact': 66.00660066006601,   
'best_exact_thresh': 0.0,  
'best_f1': 78.28040573606134,   
'best_f1_thresh': 0.0}