File size: 7,772 Bytes
51cfcd7
 
 
 
 
 
 
 
b92bac0
1ba8f2e
 
ce9a8c0
 
 
 
 
 
 
 
7052acd
93dcffc
ce9a8c0
 
eaece4c
c28a694
eaece4c
 
c28a694
93dcffc
 
 
 
 
 
7052acd
93dcffc
 
 
 
c28a694
6c83863
93dcffc
 
c28a694
6c83863
03341de
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
51cfcd7
1ba8f2e
 
 
 
 
 
 
 
 
 
360ec4b
 
64dc369
0a45a86
64dc369
 
e7c8e32
 
 
 
 
 
 
 
 
 
 
 
 
 
360ec4b
dcc302f
1ba8f2e
 
dcc302f
1ba8f2e
dcc302f
 
 
 
 
 
 
 
 
 
 
 
 
08aa6a0
9f6bb5d
08aa6a0
 
9f6bb5d
 
 
 
 
 
 
 
 
dcc302f
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
---
datasets:
- multi_nli
- snli
- scitail
metrics:
- accuracy
- f1
pipeline_tag: zero-shot-classification
language:
- en
model-index:
- name: AntoineBlanot/flan-t5-xxl-classif-3way
  results:
  - task:
      type: nli             # Required. Example: automatic-speech-recognition
      name: Natural Language Inference             # Optional. Example: Speech Recognition
    dataset:
      type: multi_nli          # Required. Example: common_voice. Use dataset id from https://hf.co/datasets
      name: MultiNLI           # Required. A pretty name for the dataset. Example: Common Voice (French)
      split: validation_matched        # Optional. Example: test
    metrics:
      - type: accuracy         # Required. Example: wer. Use metric id from https://hf.co/metrics
        value: 0.9230769230769231    # Required. Example: 20.90
        name: Validation (matched) accuracy         # Optional. Example: Test WER
      - type: f1         # Required. Example: wer. Use metric id from https://hf.co/metrics
        value: 0.9225172687920663    # Required. Example: 20.90
        name: Validation (matched) f1         # Optional. Example: Test WER

  - task:
      type: nli             # Required. Example: automatic-speech-recognition
      name: Natural Language Inference             # Optional. Example: Speech Recognition
    dataset:
      type: multi_nli          # Required. Example: common_voice. Use dataset id from https://hf.co/datasets
      name: MultiNLI           # Required. A pretty name for the dataset. Example: Common Voice (French)
      split: validation_mismatched        # Optional. Example: test
    metrics:
      - type: accuracy         # Required. Example: wer. Use metric id from https://hf.co/metrics
        value: 0.9222945484133441    # Required. Example: 20.90
        name: Validation (mismatched) accuracy         # Optional. Example: Test WER

      - type: f1         # Required. Example: wer. Use metric id from https://hf.co/metrics
        value: 0.9216699467726924    # Required. Example: 20.90
        name: Validation (mismatched) f1         # Optional. Example: Test WER

  - task:
      type: nli             # Required. Example: automatic-speech-recognition
      name: Natural Language Inference             # Optional. Example: Speech Recognition
    dataset:
      type: snli          # Required. Example: common_voice. Use dataset id from https://hf.co/datasets
      name: SNLI           # Required. A pretty name for the dataset. Example: Common Voice (French)
      split: validation        # Optional. Example: test
    metrics:
      - type: accuracy         # Required. Example: wer. Use metric id from https://hf.co/metrics
        value: 0.9418817313554155    # Required. Example: 20.90
        name: Validation accuracy         # Optional. Example: Test WER

      - type: f1         # Required. Example: wer. Use metric id from https://hf.co/metrics
        value: 0.9416213776111287    # Required. Example: 20.90
        name: Validation f1         # Optional. Example: Test WER

  - task:
      type: nli             # Required. Example: automatic-speech-recognition
      name: Natural Language Inference             # Optional. Example: Speech Recognition
    dataset:
      type: scitail          # Required. Example: common_voice. Use dataset id from https://hf.co/datasets
      name: SciTail           # Required. A pretty name for the dataset. Example: Common Voice (French)
      split: validation        # Optional. Example: test
    metrics:
      - type: accuracy         # Required. Example: wer. Use metric id from https://hf.co/metrics
        value: 0.9662576687116564    # Required. Example: 20.90
        name: Validation accuracy         # Optional. Example: Test WER

      - type: f1         # Required. Example: wer. Use metric id from https://hf.co/metrics
        value: 0.6471347983817357    # Required. Example: 20.90
        name: Validation f1         # Optional. Example: Test WER

---
# T5ForSequenceClassification
**T5ForSequenceClassification** adapts the original [T5](https://github.com/google-research/text-to-text-transfer-transformer) architecture for sequence classification tasks.

T5 was originally built for text-to-text tasks and excels in it.
It can handle any NLP task if it has been converted to a text-to-text format, including sequence classification task!
You can find [here](https://huggingface.co/google/flan-t5-base?text=Premise%3A++At+my+age+you+will+probably+have+learnt+one+lesson.+Hypothesis%3A++It%27s+not+certain+how+many+lessons+you%27ll+learn+by+your+thirties.+Does+the+premise+entail+the+hypothesis%3F) how the original T5 is used for sequence classification task. 

Our motivations for building **T5ForSequenceClassification** is that the full original T5 architecture is not needed for most NLU tasks. Indeed, NLU tasks generally do not require to generate text and thus a large decoder is unnecessary.
By removing the decoder we can *half the original number of parameters* (thus half the computation cost) and *efficiently optimize* the network for the given task.

## Table of Contents

0. [Usage](#usage)
1. [Why use T5ForSequenceClassification?](#why-use-t5forsequenceclassification)
2. [T5ForClassification vs T5](#t5forclassification-vs-t5)
3. [Results](#results)

## Usage
**T5ForSequenceClassification** supports the task of zero-shot classification.
It can direclty be used for:
- topic classification
- intent recognition
- boolean question answering
- sentiment analysis
- and any other task which goal is to clasify a text...

Since the *T5ForClassification* class is currently not supported by the transformers library, you cannot direclty use this model on the Hub.
To use **T5ForSequenceClassification**, you will have to install additional packages and model weights.
You can find instructions [here](https://github.com/AntoineBlanot/zero-nlp).


## Why use T5ForSequenceClassification?
Models based on the [BERT](https://huggingface.co/bert-large-uncased) architecture like [RoBERTa](https://huggingface.co/roberta-large) and [DeBERTa](https://huggingface.co/microsoft/deberta-v2-xxlarge) have shown very strong performance on sequence classification task and are still widely used today.
However, those models only scale up to ~1.5B parameters (DeBERTa xxlarge) resulting in a limited knowledge compare to bigger models.
On the other hand, models based on the T5 architecture scale up to ~11B parameters (t5-xxl) and innovations with this architecture are very recent and keeps improving ([mT5](https://huggingface.co/google/mt5-xxl), [Flan-T5](https://huggingface.co/google/flan-t5-xxl), [UL2](https://huggingface.co/google/ul2), [Flan-UL2](https://huggingface.co/google/flan-ul2), and probably more...) 

## T5ForClassification vs T5
**T5ForClassification** Architecture:
- Encoder: same as original T5
- Decoder: only the first layer (for pooling purpose)
- Classification head: simple Linear layer on top of the decoder

Benefits and Drawbacks:
- (**+**) Keeps T5 encoding strength
- (**+**) Parameters size is half
- (**+**) Interpretable outputs (class logits)
- (**+**) No generation mistakes and faster prediction (no generation latency)
- (**-**) Looses text-to-text ability

## Results
Results on the validation data of **training tasks**:
| Dataset | Accuracy | F1 |
|:-------:|:--------:|:--:|
| MNLI (m)| 0.923 | 0.923 |
| MNLI (mm) | 0.922 | 0.922 |
| SNLI | 0.942 | 0.942 |
| SciTail | 0.966 | 0.647 |

Results on validation data of **unseen tasks** (zero-shot):
| Dataset | Accuracy | F1 |
|:-------:|:--------:|:--:|
| ?| ? | ? |

Special thanks to [philschmid](https://huggingface.co/philschmid) for making a Flan-T5-xxl [checkpoint](https://huggingface.co/philschmid/flan-t5-xxl-sharded-fp16) in fp16.