AntoineBlanot commited on
Commit
dcc302f
1 Parent(s): 1ba8f2e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +17 -3
README.md CHANGED
@@ -20,9 +20,23 @@ You can find [here](https://huggingface.co/google/flan-t5-base?text=Premise%3A++
20
  Our motivations for building **T5ForSequenceClassification** is that the full original T5 architecture is not needed for most NLU tasks. Indeed, NLU tasks generally do not require to generate text and thus a large decoder is unnecessary.
21
  By removing the decoder we can *half the original number of parameters* (thus half the computation cost) and *efficiently optimize* the network for the given task.
22
 
23
- # Why use T5ForSequenceClassification?
24
  Models based on the [BERT](https://huggingface.co/bert-large-uncased) architecture like [RoBERTa](https://huggingface.co/roberta-large) and [DeBERTa](https://huggingface.co/microsoft/deberta-v2-xxlarge) have shown very strong performance on sequence classification task and are still widely used today.
25
  However, those models only scale up to ~1.5B parameters (DeBERTa xxlarge) resulting in a limited knowledge compare to bigger models.
26
- On the other hand, models based on the T5 architecture scale up to ~11B parameters (t5-xxl) and innovations with this architecture are very recent and keeps improving (T5, [mT5](https://huggingface.co/google/mt5-xxl), [Flan-T5](https://huggingface.co/google/flan-t5-xxl), [UL2](https://huggingface.co/google/ul2), [Flan-UL2](https://huggingface.co/google/flan-ul2), and probably more...)
27
 
28
- Model of philschmid/flan-t5-xxl-sharded-fp16 with a single decoder layer and a classification head on top.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
  Our motivations for building **T5ForSequenceClassification** is that the full original T5 architecture is not needed for most NLU tasks. Indeed, NLU tasks generally do not require to generate text and thus a large decoder is unnecessary.
21
  By removing the decoder we can *half the original number of parameters* (thus half the computation cost) and *efficiently optimize* the network for the given task.
22
 
23
+ ## Why use T5ForSequenceClassification?
24
  Models based on the [BERT](https://huggingface.co/bert-large-uncased) architecture like [RoBERTa](https://huggingface.co/roberta-large) and [DeBERTa](https://huggingface.co/microsoft/deberta-v2-xxlarge) have shown very strong performance on sequence classification task and are still widely used today.
25
  However, those models only scale up to ~1.5B parameters (DeBERTa xxlarge) resulting in a limited knowledge compare to bigger models.
26
+ On the other hand, models based on the T5 architecture scale up to ~11B parameters (t5-xxl) and innovations with this architecture are very recent and keeps improving ([mT5](https://huggingface.co/google/mt5-xxl), [Flan-T5](https://huggingface.co/google/flan-t5-xxl), [UL2](https://huggingface.co/google/ul2), [Flan-UL2](https://huggingface.co/google/flan-ul2), and probably more...)
27
 
28
+ ## T5ForClassification vs T5
29
+ **T5ForClassification** Architecture:
30
+ - Encoder: same as original T5
31
+ - Decoder: only the first layer (for pooling purpose)
32
+ - Classification head: simple Linear layer on top of the decoder
33
+
34
+ Benefits and Drawbacks:
35
+ - (**+**) Keeps T5 encoding strength
36
+ - (**+**) Parameters size is half
37
+ - (**+**) Interpretable outputs (class logits)
38
+ - (**+**) No generation mistakes and faster prediction (no generation latency)
39
+ - (**-**) Looses text-to-text ability
40
+
41
+
42
+ Special thanks to [philschmid](https://huggingface.co/philschmid) for making a Flan-T5-xxl [checkpoint](https://huggingface.co/philschmid/flan-t5-xxl-sharded-fp16) in fp16.