google
/

flan-t5-xxl

Text2Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

Update README.md

#5

by ybelkada - opened Oct 22, 2022

base: refs/heads/main

←

from: refs/pr/5

Discussion Files changed

Files changed (1) hide show

README.md +27 -0

README.md CHANGED Viewed

@@ -191,11 +191,14 @@ print(tokenizer.decode(outputs[0]))
 #### FP16
 <details>
 <summary> Click to expand </summary>
 ```python
 # pip install accelerate
 import torch
 from transformers import T5Tokenizer, T5ForConditionalGeneration
@@ -211,8 +214,32 @@ print(tokenizer.decode(outputs[0]))
 </details>
 #### INT8
 <details>
 <summary> Click to expand </summary>

 #### FP16
+The original model has been trained in `bfloat16`, therefore running such a large model in `float16` can lead to drastically reduced performance. We advise users to run this model in `bfloat16` or `float32` if they have enough compute resources. Check the next section on how to run the model in `bfloat16`.
 <details>
 <summary> Click to expand </summary>
 ```python
 # pip install accelerate
+# Not recommended - we advise users to run their model in `bfloat16`
 import torch
 from transformers import T5Tokenizer, T5ForConditionalGeneration
 </details>
+#### BFLOAT16
+<details>
+<summary> Click to expand </summary>
+```python
+# pip install accelerate
+import torch
+from transformers import T5Tokenizer, T5ForConditionalGeneration
+tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-xxl")
+model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-xxl", device_map="auto", torch_dtype=torch.bfloat16)
+input_text = "translate English to German: How old are you?"
+input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
+outputs = model.generate(input_ids)
+print(tokenizer.decode(outputs[0]))
+```
+</details>
 #### INT8
+The original model has been trained in `bfloat16`, therefore running such a large model in `int8` (the underlying technique behind `int8` quantization is to first cast the weights in `float16`) can lead to drastically reduced performance. We advise users to run this model in `bfloat16` or `float32` if they have enough compute resources. Check the next section on how to run the model in `bfloat16`.
 <details>
 <summary> Click to expand </summary>