ybelkada HF staff commited on
Commit
8539e45
1 Parent(s): 07e8717

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +246 -0
README.md ADDED
@@ -0,0 +1,246 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+
5
+ tags:
6
+ - text2text-generation
7
+
8
+ widget:
9
+ - text: "summarize: Peter and Elizabeth took a taxi to attend the night party in the city. While in the party, Elizabeth collapsed and was rushed to the hospital. Since she was diagnosed with a brain injury, the doctor told Peter to stay besides her until she gets well. Therefore, Peter stayed with her at the hospital for 3 days without leaving."
10
+ example_title: "Summarization"
11
+
12
+ datasets:
13
+ - c4
14
+ - xsum
15
+
16
+
17
+ license: apache-2.0
18
+ ---
19
+
20
+ # Model Card for Switch Transformers Base - 8 experts
21
+
22
+ ![model image](https://s3.amazonaws.com/moonup/production/uploads/1666966931908-62441d1d9fdefb55a0b7d12c.png)
23
+
24
+ # Table of Contents
25
+
26
+ 0. [TL;DR](#TL;DR)
27
+ 1. [Model Details](#model-details)
28
+ 2. [Usage](#usage)
29
+ 3. [Uses](#uses)
30
+ 4. [Bias, Risks, and Limitations](#bias-risks-and-limitations)
31
+ 5. [Training Details](#training-details)
32
+ 6. [Evaluation](#evaluation)
33
+ 7. [Environmental Impact](#environmental-impact)
34
+ 8. [Citation](#citation)
35
+ 9. [Model Card Authors](#model-card-authors)
36
+
37
+ # TL;DR
38
+
39
+ Switch Transformers is a Mixture of Experts (MoE) model trained on Masked Language Modeling (MLM) task. The model architecture is similar to the classic T5, but with the Feed Forward layers replaced by the Sparse MLP layers containing "experts" MLP. According to the [original paper](https://arxiv.org/pdf/2101.03961.pdf) the model enables faster training (scaling properties) while being better than T5 on fine-tuned tasks.
40
+ As mentioned in the first few lines of the abstract :
41
+ > we advance the current scale of language models by pre-training up to trillion parameter models on the “Colossal Clean Crawled Corpus”, and achieve a 4x speedup over the T5-XXL model.
42
+
43
+ **Disclaimer**: Content from **this** model card has been written by the Hugging Face team, and parts of it were copy pasted from the [original paper](https://arxiv.org/pdf/2101.03961.pdf).
44
+
45
+ # Model Details
46
+
47
+ ## Model Description
48
+
49
+
50
+ - **Model type:** Language model
51
+ - **Language(s) (NLP):** Englis
52
+ - **License:** Apache 2.0
53
+ - **Related Models:** [All FLAN-T5 Checkpoints](https://huggingface.co/models?search=switch)
54
+ - **Original Checkpoints:** [All Original FLAN-T5 Checkpoints](https://github.com/google-research/t5x/blob/main/docs/models.md#mixture-of-experts-moe-checkpoints)
55
+ - **Resources for more information:**
56
+ - [Research paper](https://arxiv.org/pdf/2101.03961.pdf)
57
+ - [GitHub Repo](https://github.com/google-research/t5x)
58
+ - [Hugging Face Switch Transformers Docs (Similar to T5) ](https://huggingface.co/docs/transformers/model_doc/switch_transformers)
59
+
60
+ # Usage
61
+
62
+ Note that these checkpoints has been trained on Masked-Language Modeling (MLM) task. Therefore the checkpoints are not "ready-to-use" for downstream tasks. You may want to check `FLAN-T5` for running fine-tuned weights or fine-tune your own MoE following [this notebook](https://colab.research.google.com/drive/1aGGVHZmtKmcNBbAwa9hbu58DDpIuB5O4?usp=sharing)
63
+
64
+ Find below some example scripts on how to use the model in `transformers`:
65
+
66
+ ## Using the Pytorch model
67
+
68
+ ### Running the model on a CPU
69
+
70
+ <details>
71
+ <summary> Click to expand </summary>
72
+
73
+ ```python
74
+
75
+ from transformers import AutoTokenizer, SwitchTransformersConditionalGeneration
76
+
77
+ tokenizer = AutoTokenizer.from_pretrained("google/switch-base-8")
78
+ model = SwitchTransformersConditionalGeneration.from_pretrained("google/switch-base-8")
79
+
80
+ input_text = "A <extra_id_0> walks into a bar a orders a <extra_id_1> with <extra_id_2> pinch of <extra_id_3>."
81
+ input_ids = tokenizer(input_text, return_tensors="pt").input_ids
82
+
83
+ outputs = model.generate(input_ids)
84
+ print(tokenizer.decode(outputs[0]))
85
+ >>> <pad> <extra_id_0> man<extra_id_1> beer<extra_id_2> a<extra_id_3> salt<extra_id_4>.</s>
86
+ ```
87
+
88
+ </details>
89
+
90
+ ### Running the model on a GPU
91
+
92
+ <details>
93
+ <summary> Click to expand </summary>
94
+
95
+ ```python
96
+ # pip install accelerate
97
+ from transformers import AutoTokenizer, SwitchTransformersConditionalGeneration
98
+
99
+ tokenizer = AutoTokenizer.from_pretrained("google/switch-base-8")
100
+ model = SwitchTransformersConditionalGeneration.from_pretrained("google/switch-base-8", device_map="auto")
101
+
102
+ input_text = "A <extra_id_0> walks into a bar a orders a <extra_id_1> with <extra_id_2> pinch of <extra_id_3>."
103
+ input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(0)
104
+
105
+ outputs = model.generate(input_ids)
106
+ print(tokenizer.decode(outputs[0]))
107
+ ```
108
+
109
+ </details>
110
+
111
+ ### Running the model on a GPU using different precisions
112
+
113
+ #### FP16
114
+
115
+ <details>
116
+ <summary> Click to expand </summary>
117
+
118
+ ```python
119
+ # pip install accelerate
120
+ from transformers import AutoTokenizer, SwitchTransformersConditionalGeneration
121
+
122
+ tokenizer = AutoTokenizer.from_pretrained("google/switch-base-8")
123
+ model = SwitchTransformersConditionalGeneration.from_pretrained("google/switch-base-8", device_map="auto", torch_dtype=torch.float16)
124
+
125
+ input_text = "A <extra_id_0> walks into a bar a orders a <extra_id_1> with <extra_id_2> pinch of <extra_id_3>."
126
+ input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(0)
127
+
128
+ outputs = model.generate(input_ids)
129
+ print(tokenizer.decode(outputs[0]))
130
+ ```
131
+
132
+ </details>
133
+
134
+ #### INT8
135
+
136
+ <details>
137
+ <summary> Click to expand </summary>
138
+
139
+ ```python
140
+ # pip install bitsandbytes accelerate
141
+ from transformers import AutoTokenizer, SwitchTransformersConditionalGeneration
142
+
143
+ tokenizer = AutoTokenizer.from_pretrained("google/switch-base-8")
144
+ model = SwitchTransformersConditionalGeneration.from_pretrained("google/switch-base-8", device_map="auto")
145
+
146
+ input_text = "A <extra_id_0> walks into a bar a orders a <extra_id_1> with <extra_id_2> pinch of <extra_id_3>."
147
+ input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(0)
148
+
149
+ outputs = model.generate(input_ids)
150
+ print(tokenizer.decode(outputs[0]))
151
+ ```
152
+
153
+ </details>
154
+
155
+ # Uses
156
+
157
+ ## Direct Use and Downstream Use
158
+
159
+ The authors write in [the original paper's model card](https://arxiv.org/pdf/2210.11416.pdf) that:
160
+
161
+ > The primary use is research on language models, including: research on zero-shot NLP tasks and in-context few-shot learning NLP tasks, such as reasoning, and question answering; advancing fairness and safety research, and understanding limitations of current large language models
162
+
163
+ See the [research paper](https://arxiv.org/pdf/2210.11416.pdf) for further details.
164
+
165
+ ## Out-of-Scope Use
166
+
167
+ More information needed.
168
+
169
+ # Bias, Risks, and Limitations
170
+
171
+ More information needed.
172
+
173
+ ## Ethical considerations and risks
174
+
175
+ More information needed.
176
+
177
+ ## Known Limitations
178
+
179
+ More information needed.
180
+
181
+ ## Sensitive Use:
182
+
183
+ > Flan-T5 should not be applied for any unacceptable use cases, e.g., generation of abusive speech.
184
+
185
+ # Training Details
186
+
187
+ ## Training Data
188
+
189
+ The model was trained on a Masked Language Modeling task, on Colossal Clean Crawled Corpus (C4) dataset, following the same procedure as `T5`.
190
+
191
+
192
+ ## Training Procedure
193
+
194
+ According to the model card from the [original paper](https://arxiv.org/pdf/2210.11416.pdf):
195
+
196
+ > These models are based on pretrained T5 (Raffel et al., 2020) and fine-tuned with instructions for better zero-shot and few-shot performance. There is one fine-tuned Flan model per T5 model size.
197
+
198
+ The model has been trained on TPU v3 or TPU v4 pods, using [`t5x`](https://github.com/google-research/t5x) codebase together with [`jax`](https://github.com/google/jax).
199
+
200
+
201
+ # Evaluation
202
+
203
+ ## Testing Data, Factors & Metrics
204
+
205
+ The authors evaluated the model on various tasks and compared the results against T5. See the table below for some quantitative evaluation:
206
+ ![image.png](https://s3.amazonaws.com/moonup/production/uploads/1666967660372-62441d1d9fdefb55a0b7d12c.png)
207
+ For full details, please check the [research paper](https://arxiv.org/pdf/2101.03961.pdf).
208
+
209
+ ## Results
210
+
211
+ For full results for Switch Transformers, see the [research paper](https://arxiv.org/pdf/2101.03961.pdf), Table 5.
212
+
213
+ # Environmental Impact
214
+
215
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
216
+
217
+ - **Hardware Type:** Google Cloud TPU Pods - TPU v3 or TPU v4 | Number of chips ≥ 4.
218
+ - **Hours used:** More information needed
219
+ - **Cloud Provider:** GCP
220
+ - **Compute Region:** More information needed
221
+ - **Carbon Emitted:** More information needed
222
+
223
+ # Citation
224
+
225
+ **BibTeX:**
226
+
227
+ ```bibtex
228
+ @misc{https://doi.org/10.48550/arxiv.2101.03961,
229
+ doi = {10.48550/ARXIV.2101.03961},
230
+
231
+ url = {https://arxiv.org/abs/2101.03961},
232
+
233
+ author = {Fedus, William and Zoph, Barret and Shazeer, Noam},
234
+
235
+ keywords = {Machine Learning (cs.LG), Artificial Intelligence (cs.AI), FOS: Computer and information sciences, FOS: Computer and information sciences},
236
+
237
+ title = {Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity},
238
+
239
+ publisher = {arXiv},
240
+
241
+ year = {2021},
242
+
243
+ copyright = {arXiv.org perpetual, non-exclusive license}
244
+ }
245
+
246
+ ```