File size: 4,884 Bytes
0bbac5e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fb1f066
0bbac5e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5cf6ecc
0bbac5e
 
 
 
 
 
fb1f066
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0bbac5e
 
 
 
 
 
 
fb1f066
0bbac5e
fb1f066
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
---
language: en
datasets:
- c4
- wikipedia
metrics:
- f1
---

# T5-V1.1-large-rss
This model is [T5-v1.1-large](https://huggingface.co/google/t5-v1_1-large) finetuned on RSS dataset. The model was finetuned as part of 
["How Optimal is Greedy Decoding for Extractive Question Answering?"](https://arxiv.org/abs/2108.05857), while the RSS pretraining method was introduced in [this paper](https://arxiv.org/pdf/2101.00438.pdf).

## Model description
The original [T5-v1.1-large](https://huggingface.co/google/t5-v1_1-large) was only pre-trained on C4 excluding any supervised training. Our version is further trained on Rucurrent Span Selection scheme (RSS), using a sample from the dataset used to pretrain [Splinter](tau/splinter-large):
* contexts with a span occurring more than once are detected
* a single instance of the recurring span is maked
* the model is trained (teacher forcing) to predict the masked span
This training scheme naturally matches the extractive question answering task.

During training time, the masked span is replaced with `<extra_id_0>` and the labels are formatted as `<extra_id_0>span<extra_id_0>`. Unlike [Splinter](tau/splinter-large), only one span is mask at a time.

## Intended uses & limitations
This model naturally fits tasks where a span from a context is intended to be copied, like extractive question answering.
This checkpoint is primarily aimed to be used in zero-shot setting - further fine-tuning it on an annotated dataset gives equal results to those of the original T5-v1.1-large.

### How to use
You can use this model directly but it is recommended to format the input to be aligned with that of the training scheme and as a text-question context:
```python
from transformers import  AutoModelForSeq2SeqLM, AutoTokenizer
model = AutoModelForSeq2SeqLM.from_pretrained('tau/t5-v1_1-large-rss')
tokenizer = AutoTokenizer.from_pretrained('tau/t5-v1_1-large-rss')

passage = 'Barack Hussein Obama II is an American politician and attorney who served as the 44th president of the United States from 2009 to 2017. '
question = 'When was Obama inaugurated?'
text = f'Text: {passage}.\nQuestion: {question}\nAnswer:{tokenizer.additional_special_tokens[0]}.'
encoded_input = tokenizer(text, return_tensors='pt')
output_ids = model.generate(input_ids=encoded_input.input_ids, attention_mask=encoded_input.attention_mask,
               eos_token_id=tokenizer.additional_special_tokens_ids[1], num_beams=1, max_length=512, min_length=3)
tokenizer.decode(output_ids[0])
```
The generated answer is then `"<pad><extra_id_0> 2009<extra_id_1>"`, while the one generated by the original [T5-v1.1-large](https://huggingface.co/google/t5-v1_1-large) is `"<pad><extra_id_0> On January 20, 2009<extra_id_1>"` - a correct yet non-extractive answer.

### Limitations and bias
Although using the model with greedy decoding tends toward extracted outputs, is may sometimes produce non-extracted ones - may it be different casing or a whole different string (or substring) that may bear another semantic meaning.

### Pretraining
The model was finetuned with 100,000 rss-examples for 3 epochs using Adafactor optimizer with constant learning rate of 5e-5.

## Evaluation results
Evaluated over few-shot QA in a zero-shot setting (no finetuning on annotated examples):

|Model \ Dataset| SQuAD |TriviaQA | NaturalQs | NewsQA | SearchQA | HotpotQA | BioASQ | TextbookQA| 
|:-------------:|:-----:|:-------:|:---------:|:------:|:--------:|:--------:|:------:|:---------:| 
|T5             | 50.4  | 61.7    | 42.1      | 19.2   | 24.0     | 43.3     | 55.5   | 17.8      | 
|T5-rss         | 71.4  | 69.3    | 57.2      | 43.2   | 29.7     | 59.0     | 65.5   | 39.0      | 

The gap between the two models diminishes as more training examples are introduced, for additional result see the [paper]((https://arxiv.org/abs/2108.05857).

### BibTeX entry and citation info
```bibtex
@inproceedings{ram-etal-2021-shot,
    title = "Few-Shot Question Answering by Pretraining Span Selection",
    author = "Ram, Ori  and
      Kirstain, Yuval  and
      Berant, Jonathan  and
      Globerson, Amir  and
      Levy, Omer",
    booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.acl-long.239",
    doi = "10.18653/v1/2021.acl-long.239",
    pages = "3066--3079",
},
@misc{castel2021optimal,
      title={How Optimal is Greedy Decoding for Extractive Question Answering?}, 
      author={Or Castel and Ori Ram and Avia Efrat and Omer Levy},
      year={2021},
      eprint={2108.05857},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

```