Upload README.md
Browse files
README.md
CHANGED
@@ -12,8 +12,16 @@ tags:
|
|
12 |
---
|
13 |
Quantizations of https://huggingface.co/google/gemma-2-27b-it
|
14 |
|
15 |
-
|
16 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
17 |
|
18 |
# From original readme
|
19 |
|
@@ -24,6 +32,8 @@ Below we share some code snippets on how to get quickly started with running the
|
|
24 |
|
25 |
#### Running the model on a single / multi GPU
|
26 |
|
|
|
|
|
27 |
|
28 |
```python
|
29 |
# pip install accelerate
|
@@ -47,51 +57,10 @@ print(tokenizer.decode(outputs[0]))
|
|
47 |
<a name="precisions"></a>
|
48 |
#### Running the model on a GPU using different precisions
|
49 |
|
50 |
-
The native weights of this model were exported in `bfloat16` precision.
|
51 |
|
52 |
You can also use `float32` if you skip the dtype, but no precision increase will occur (model weights will just be upcasted to `float32`). See examples below.
|
53 |
|
54 |
-
* _Using `torch.float16`_
|
55 |
-
|
56 |
-
```python
|
57 |
-
# pip install accelerate
|
58 |
-
from transformers import AutoTokenizer, AutoModelForCausalLM
|
59 |
-
import torch
|
60 |
-
|
61 |
-
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-27b-it")
|
62 |
-
model = AutoModelForCausalLM.from_pretrained(
|
63 |
-
"google/gemma-2-27b-it",
|
64 |
-
device_map="auto",
|
65 |
-
torch_dtype=torch.float16,
|
66 |
-
revision="float16",
|
67 |
-
)
|
68 |
-
|
69 |
-
input_text = "Write me a poem about Machine Learning."
|
70 |
-
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
|
71 |
-
|
72 |
-
outputs = model.generate(**input_ids)
|
73 |
-
print(tokenizer.decode(outputs[0]))
|
74 |
-
```
|
75 |
-
|
76 |
-
* _Using `torch.bfloat16`_
|
77 |
-
|
78 |
-
```python
|
79 |
-
# pip install accelerate
|
80 |
-
from transformers import AutoTokenizer, AutoModelForCausalLM
|
81 |
-
|
82 |
-
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-27b-it")
|
83 |
-
model = AutoModelForCausalLM.from_pretrained(
|
84 |
-
"google/gemma-2-27b-it",
|
85 |
-
device_map="auto",
|
86 |
-
torch_dtype=torch.bfloat16)
|
87 |
-
|
88 |
-
input_text = "Write me a poem about Machine Learning."
|
89 |
-
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
|
90 |
-
|
91 |
-
outputs = model.generate(**input_ids)
|
92 |
-
print(tokenizer.decode(outputs[0]))
|
93 |
-
```
|
94 |
-
|
95 |
* _Upcasting to `torch.float32`_
|
96 |
|
97 |
```python
|
@@ -158,6 +127,9 @@ print(tokenizer.decode(outputs[0]))
|
|
158 |
|
159 |
* _Flash Attention 2_
|
160 |
|
|
|
|
|
|
|
161 |
First make sure to install `flash-attn` in your environment `pip install flash-attn`
|
162 |
|
163 |
```diff
|
@@ -217,4 +189,4 @@ After the prompt is ready, generation can be performed like this:
|
|
217 |
inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
|
218 |
outputs = model.generate(input_ids=inputs.to(model.device), max_new_tokens=150)
|
219 |
print(tokenizer.decode(outputs[0]))
|
220 |
-
```
|
|
|
12 |
---
|
13 |
Quantizations of https://huggingface.co/google/gemma-2-27b-it
|
14 |
|
15 |
+
Update (July 8, 2024): **Requantized and reuploaded** using llama.cpp latest version (b3325), everything should work as expected.
|
16 |
|
17 |
+
### Inference Clients/UIs
|
18 |
+
* [llama.cpp](https://github.com/ggerganov/llama.cpp)
|
19 |
+
* [JanAI](https://github.com/janhq/jan)
|
20 |
+
* [KoboldCPP](https://github.com/LostRuins/koboldcpp)
|
21 |
+
* [text-generation-webui](https://github.com/oobabooga/text-generation-webui)
|
22 |
+
* [ollama](https://github.com/ollama/ollama)
|
23 |
+
|
24 |
+
---
|
25 |
|
26 |
# From original readme
|
27 |
|
|
|
32 |
|
33 |
#### Running the model on a single / multi GPU
|
34 |
|
35 |
+
> [!IMPORTANT]
|
36 |
+
> Given the model instabilities with SDPA/ FA2, by default, the model inference would utilise `eager` attention.
|
37 |
|
38 |
```python
|
39 |
# pip install accelerate
|
|
|
57 |
<a name="precisions"></a>
|
58 |
#### Running the model on a GPU using different precisions
|
59 |
|
60 |
+
The native weights of this model were exported in `bfloat16` precision.
|
61 |
|
62 |
You can also use `float32` if you skip the dtype, but no precision increase will occur (model weights will just be upcasted to `float32`). See examples below.
|
63 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
64 |
* _Upcasting to `torch.float32`_
|
65 |
|
66 |
```python
|
|
|
127 |
|
128 |
* _Flash Attention 2_
|
129 |
|
130 |
+
> [!WARNING]
|
131 |
+
> Gemma 2 is currently incompatible with Flash Attention/ SDPA, using it might result in unreliable generations. Use at your own risk.
|
132 |
+
|
133 |
First make sure to install `flash-attn` in your environment `pip install flash-attn`
|
134 |
|
135 |
```diff
|
|
|
189 |
inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
|
190 |
outputs = model.generate(input_ids=inputs.to(model.device), max_new_tokens=150)
|
191 |
print(tokenizer.decode(outputs[0]))
|
192 |
+
```
|