File size: 4,168 Bytes
51cfcd7 b92bac0 1ba8f2e ce9a8c0 51cfcd7 1ba8f2e 360ec4b e7c8e32 360ec4b dcc302f 1ba8f2e dcc302f 1ba8f2e dcc302f |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 |
---
datasets:
- multi_nli
- snli
- scitail
metrics:
- accuracy
- f1
pipeline_tag: zero-shot-classification
language:
- en
model-index:
- name: AntoineBlanot/flan-t5-xxl-classif-3way
results:
- task:
type: nli # Required. Example: automatic-speech-recognition
name: Natural Language Inference # Optional. Example: Speech Recognition
dataset:
type: multi_nli # Required. Example: common_voice. Use dataset id from https://hf.co/datasets
name: MNLI # Required. A pretty name for the dataset. Example: Common Voice (French)
metrics:
- type: accuracy # Required. Example: wer. Use metric id from https://hf.co/metrics
value: 92.2 # Required. Example: 20.90
---
# T5ForSequenceClassification
**T5ForSequenceClassification** adapts the original [T5](https://github.com/google-research/text-to-text-transfer-transformer) architecture for sequence classification tasks.
T5 was originally built for text-to-text tasks and excels in it.
It can handle any NLP task if it has been converted to a text-to-text format, including sequence classification task!
You can find [here](https://huggingface.co/google/flan-t5-base?text=Premise%3A++At+my+age+you+will+probably+have+learnt+one+lesson.+Hypothesis%3A++It%27s+not+certain+how+many+lessons+you%27ll+learn+by+your+thirties.+Does+the+premise+entail+the+hypothesis%3F) how the original T5 is used for sequence classification task.
Our motivations for building **T5ForSequenceClassification** is that the full original T5 architecture is not needed for most NLU tasks. Indeed, NLU tasks generally do not require to generate text and thus a large decoder is unnecessary.
By removing the decoder we can *half the original number of parameters* (thus half the computation cost) and *efficiently optimize* the network for the given task.
## Table of Contents
0. [Usage](##usage)
1. [Why use T5ForSequenceClassification?](##why-use-t5forsequenceclassification?)
2. [T5ForClassification vs T5](##t5forclassification-vs-t5)
## Usage
**T5ForSequenceClassification** supports the task of zero-shot classification.
It can direclty be used for:
- topic classification
- intent recognition
- boolean question answering
- sentiment analysis
- and any other task which goal is to clasify a text...
Since the *T5ForClassification* class is currently not supported by the transformers library, you cannot direclty use this model on the Hub.
To use **T5ForSequenceClassification**, you will have to install additional packages and model weights.
You can find instructions [here](https://github.com/AntoineBlanot/zero-nlp).
## Why use T5ForSequenceClassification?
Models based on the [BERT](https://huggingface.co/bert-large-uncased) architecture like [RoBERTa](https://huggingface.co/roberta-large) and [DeBERTa](https://huggingface.co/microsoft/deberta-v2-xxlarge) have shown very strong performance on sequence classification task and are still widely used today.
However, those models only scale up to ~1.5B parameters (DeBERTa xxlarge) resulting in a limited knowledge compare to bigger models.
On the other hand, models based on the T5 architecture scale up to ~11B parameters (t5-xxl) and innovations with this architecture are very recent and keeps improving ([mT5](https://huggingface.co/google/mt5-xxl), [Flan-T5](https://huggingface.co/google/flan-t5-xxl), [UL2](https://huggingface.co/google/ul2), [Flan-UL2](https://huggingface.co/google/flan-ul2), and probably more...)
## T5ForClassification vs T5
**T5ForClassification** Architecture:
- Encoder: same as original T5
- Decoder: only the first layer (for pooling purpose)
- Classification head: simple Linear layer on top of the decoder
Benefits and Drawbacks:
- (**+**) Keeps T5 encoding strength
- (**+**) Parameters size is half
- (**+**) Interpretable outputs (class logits)
- (**+**) No generation mistakes and faster prediction (no generation latency)
- (**-**) Looses text-to-text ability
Special thanks to [philschmid](https://huggingface.co/philschmid) for making a Flan-T5-xxl [checkpoint](https://huggingface.co/philschmid/flan-t5-xxl-sharded-fp16) in fp16.
|