--- datasets: - multi_nli - snli - scitail metrics: - accuracy - f1 pipeline_tag: zero-shot-classification language: - en model-index: - name: AntoineBlanot/flan-t5-xxl-classif-3way results: - task: type: nli # Required. Example: automatic-speech-recognition name: Natural Language Inference # Optional. Example: Speech Recognition dataset: type: multi_nli # Required. Example: common_voice. Use dataset id from https://hf.co/datasets name: MultiNLI # Required. A pretty name for the dataset. Example: Common Voice (French) split: validation_matched # Optional. Example: test metrics: - type: accuracy # Required. Example: wer. Use metric id from https://hf.co/metrics value: 0.9230769230769231 # Required. Example: 20.90 name: Validation (matched) accuracy # Optional. Example: Test WER - type: f1 # Required. Example: wer. Use metric id from https://hf.co/metrics value: 0.9225172687920663 # Required. Example: 20.90 name: Validation (matched) f1 # Optional. Example: Test WER - task: type: nli # Required. Example: automatic-speech-recognition name: Natural Language Inference # Optional. Example: Speech Recognition dataset: type: multi_nli # Required. Example: common_voice. Use dataset id from https://hf.co/datasets name: MultiNLI # Required. A pretty name for the dataset. Example: Common Voice (French) split: validation_mismatched # Optional. Example: test metrics: - type: accuracy # Required. Example: wer. Use metric id from https://hf.co/metrics value: 0.9222945484133441 # Required. Example: 20.90 name: Validation (mismatched) accuracy # Optional. Example: Test WER - type: f1 # Required. Example: wer. Use metric id from https://hf.co/metrics value: 0.9216699467726924 # Required. Example: 20.90 name: Validation (mismatched) f1 # Optional. Example: Test WER - task: type: nli # Required. Example: automatic-speech-recognition name: Natural Language Inference # Optional. Example: Speech Recognition dataset: type: snli # Required. Example: common_voice. Use dataset id from https://hf.co/datasets name: SNLI # Required. A pretty name for the dataset. Example: Common Voice (French) split: validation # Optional. Example: test metrics: - type: accuracy # Required. Example: wer. Use metric id from https://hf.co/metrics value: 0.9418817313554155 # Required. Example: 20.90 name: Validation accuracy # Optional. Example: Test WER - type: f1 # Required. Example: wer. Use metric id from https://hf.co/metrics value: 0.9416213776111287 # Required. Example: 20.90 name: Validation f1 # Optional. Example: Test WER - task: type: nli # Required. Example: automatic-speech-recognition name: Natural Language Inference # Optional. Example: Speech Recognition dataset: type: scitail # Required. Example: common_voice. Use dataset id from https://hf.co/datasets name: SciTail # Required. A pretty name for the dataset. Example: Common Voice (French) split: validation # Optional. Example: test metrics: - type: accuracy # Required. Example: wer. Use metric id from https://hf.co/metrics value: 0.9662576687116564 # Required. Example: 20.90 name: Validation accuracy # Optional. Example: Test WER - type: f1 # Required. Example: wer. Use metric id from https://hf.co/metrics value: 0.6471347983817357 # Required. Example: 20.90 name: Validation f1 # Optional. Example: Test WER --- # T5ForSequenceClassification **T5ForSequenceClassification** adapts the original [T5](https://github.com/google-research/text-to-text-transfer-transformer) architecture for sequence classification tasks. T5 was originally built for text-to-text tasks and excels in it. It can handle any NLP task if it has been converted to a text-to-text format, including sequence classification task! You can find [here](https://huggingface.co/google/flan-t5-base?text=Premise%3A++At+my+age+you+will+probably+have+learnt+one+lesson.+Hypothesis%3A++It%27s+not+certain+how+many+lessons+you%27ll+learn+by+your+thirties.+Does+the+premise+entail+the+hypothesis%3F) how the original T5 is used for sequence classification task. Our motivations for building **T5ForSequenceClassification** is that the full original T5 architecture is not needed for most NLU tasks. Indeed, NLU tasks generally do not require to generate text and thus a large decoder is unnecessary. By removing the decoder we can *half the original number of parameters* (thus half the computation cost) and *efficiently optimize* the network for the given task. ## Table of Contents 0. [Usage](#usage) 1. [Why use T5ForSequenceClassification?](#why-use-t5forsequenceclassification) 2. [T5ForClassification vs T5](#t5forclassification-vs-t5) 3. [Results](#results) ## Usage **T5ForSequenceClassification** supports the task of zero-shot classification. It can direclty be used for: - topic classification - intent recognition - boolean question answering - sentiment analysis - and any other task which goal is to clasify a text... Since the *T5ForClassification* class is currently not supported by the transformers library, you cannot direclty use this model on the Hub. To use **T5ForSequenceClassification**, you will have to install additional packages and model weights. You can find instructions [here](https://github.com/AntoineBlanot/zero-nlp). ## Why use T5ForSequenceClassification? Models based on the [BERT](https://huggingface.co/bert-large-uncased) architecture like [RoBERTa](https://huggingface.co/roberta-large) and [DeBERTa](https://huggingface.co/microsoft/deberta-v2-xxlarge) have shown very strong performance on sequence classification task and are still widely used today. However, those models only scale up to ~1.5B parameters (DeBERTa xxlarge) resulting in a limited knowledge compare to bigger models. On the other hand, models based on the T5 architecture scale up to ~11B parameters (t5-xxl) and innovations with this architecture are very recent and keeps improving ([mT5](https://huggingface.co/google/mt5-xxl), [Flan-T5](https://huggingface.co/google/flan-t5-xxl), [UL2](https://huggingface.co/google/ul2), [Flan-UL2](https://huggingface.co/google/flan-ul2), and probably more...) ## T5ForClassification vs T5 **T5ForClassification** Architecture: - Encoder: same as original T5 - Decoder: only the first layer (for pooling purpose) - Classification head: simple Linear layer on top of the decoder Benefits and Drawbacks: - (**+**) Keeps T5 encoding strength - (**+**) Parameters size is half - (**+**) Interpretable outputs (class logits) - (**+**) No generation mistakes and faster prediction (no generation latency) - (**-**) Looses text-to-text ability ## Results Results on the validation data of **training tasks**: | Dataset | Accuracy | F1 | |:-------:|:--------:|:--:| | MNLI (m)| 0.923 | 0.923 | | MNLI (mm) | 0.922 | 0.922 | | SNLI | 0.942 | 0.942 | | SciTail | 0.966 | 0.647 | Results on validation data of **unseen tasks** (zero-shot): | Dataset | Accuracy | F1 | |:-------:|:--------:|:--:| | ?| ? | ? | Special thanks to [philschmid](https://huggingface.co/philschmid) for making a Flan-T5-xxl [checkpoint](https://huggingface.co/philschmid/flan-t5-xxl-sharded-fp16) in fp16.