replit
/

replit-code-v1-3b

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

pirroh commited on May 3, 2023

Commit

749980f

•

1 Parent(s): 266de40

Update README.md (#5)

- Update README.md (2c290064b0dc519eee5624dfd3625b729ac5aea2)

Files changed (1) hide show

README.md +7 -2

README.md CHANGED Viewed

@@ -95,8 +95,13 @@ from transformers import AutoModelForCausalLM
 model = AutoModelForCausalLM.from_pretrained('replit/replit-code-v1-3b', trust_remote_code=True)
 ```
-To use the optimized Triton implementation of FlashAttention on GPUs with BF16 precision, move the model to `bfloat16` and use it as follows:
 ```python
 from transformers import AutoModelForCausalLM
@@ -106,7 +111,7 @@ model.to(device='cuda:0', dtype=torch.bfloat16)
 # forward pass
 x = torch.tensor([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
-x = x.to(device='cuda:0', dtype=torch.bfloat16)
 y = model(x)
 ```

 model = AutoModelForCausalLM.from_pretrained('replit/replit-code-v1-3b', trust_remote_code=True)
 ```
+To use the optimized Triton implementation of FlashAttention on GPUs with BF16 precision, first install the following dependencies:
+```
+flash-attn==0.2.8
+triton==2.0.0.dev20221202
+```
+Then, move the model to `bfloat16` and use it as follows:
 ```python
 from transformers import AutoModelForCausalLM
 # forward pass
 x = torch.tensor([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
+x = x.to(device='cuda:0')
 y = model(x)
 ```