duyntnet commited on
Commit
84d161b
1 Parent(s): 5812d52

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +16 -44
README.md CHANGED
@@ -12,8 +12,16 @@ tags:
12
  ---
13
  Quantizations of https://huggingface.co/google/gemma-2-27b-it
14
 
15
- **Note**: All quants are created using latest [llama.cpp release](https://github.com/ggerganov/llama.cpp/releases) (b3266). This version (hopefully) fixes all Gemma 2 27B problems. You will need the latest version of llama.cpp to use these quants.
16
 
 
 
 
 
 
 
 
 
17
 
18
  # From original readme
19
 
@@ -24,6 +32,8 @@ Below we share some code snippets on how to get quickly started with running the
24
 
25
  #### Running the model on a single / multi GPU
26
 
 
 
27
 
28
  ```python
29
  # pip install accelerate
@@ -47,51 +57,10 @@ print(tokenizer.decode(outputs[0]))
47
  <a name="precisions"></a>
48
  #### Running the model on a GPU using different precisions
49
 
50
- The native weights of this model were exported in `bfloat16` precision. You can use `float16`, which may be faster on certain hardware, indicating the `torch_dtype` when loading the model. For convenience, the `float16` revision of the repo contains a copy of the weights already converted to that precision.
51
 
52
  You can also use `float32` if you skip the dtype, but no precision increase will occur (model weights will just be upcasted to `float32`). See examples below.
53
 
54
- * _Using `torch.float16`_
55
-
56
- ```python
57
- # pip install accelerate
58
- from transformers import AutoTokenizer, AutoModelForCausalLM
59
- import torch
60
-
61
- tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-27b-it")
62
- model = AutoModelForCausalLM.from_pretrained(
63
- "google/gemma-2-27b-it",
64
- device_map="auto",
65
- torch_dtype=torch.float16,
66
- revision="float16",
67
- )
68
-
69
- input_text = "Write me a poem about Machine Learning."
70
- input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
71
-
72
- outputs = model.generate(**input_ids)
73
- print(tokenizer.decode(outputs[0]))
74
- ```
75
-
76
- * _Using `torch.bfloat16`_
77
-
78
- ```python
79
- # pip install accelerate
80
- from transformers import AutoTokenizer, AutoModelForCausalLM
81
-
82
- tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-27b-it")
83
- model = AutoModelForCausalLM.from_pretrained(
84
- "google/gemma-2-27b-it",
85
- device_map="auto",
86
- torch_dtype=torch.bfloat16)
87
-
88
- input_text = "Write me a poem about Machine Learning."
89
- input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
90
-
91
- outputs = model.generate(**input_ids)
92
- print(tokenizer.decode(outputs[0]))
93
- ```
94
-
95
  * _Upcasting to `torch.float32`_
96
 
97
  ```python
@@ -158,6 +127,9 @@ print(tokenizer.decode(outputs[0]))
158
 
159
  * _Flash Attention 2_
160
 
 
 
 
161
  First make sure to install `flash-attn` in your environment `pip install flash-attn`
162
 
163
  ```diff
@@ -217,4 +189,4 @@ After the prompt is ready, generation can be performed like this:
217
  inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
218
  outputs = model.generate(input_ids=inputs.to(model.device), max_new_tokens=150)
219
  print(tokenizer.decode(outputs[0]))
220
- ```
 
12
  ---
13
  Quantizations of https://huggingface.co/google/gemma-2-27b-it
14
 
15
+ Update (July 8, 2024): **Requantized and reuploaded** using llama.cpp latest version (b3325), everything should work as expected.
16
 
17
+ ### Inference Clients/UIs
18
+ * [llama.cpp](https://github.com/ggerganov/llama.cpp)
19
+ * [JanAI](https://github.com/janhq/jan)
20
+ * [KoboldCPP](https://github.com/LostRuins/koboldcpp)
21
+ * [text-generation-webui](https://github.com/oobabooga/text-generation-webui)
22
+ * [ollama](https://github.com/ollama/ollama)
23
+
24
+ ---
25
 
26
  # From original readme
27
 
 
32
 
33
  #### Running the model on a single / multi GPU
34
 
35
+ > [!IMPORTANT]
36
+ > Given the model instabilities with SDPA/ FA2, by default, the model inference would utilise `eager` attention.
37
 
38
  ```python
39
  # pip install accelerate
 
57
  <a name="precisions"></a>
58
  #### Running the model on a GPU using different precisions
59
 
60
+ The native weights of this model were exported in `bfloat16` precision.
61
 
62
  You can also use `float32` if you skip the dtype, but no precision increase will occur (model weights will just be upcasted to `float32`). See examples below.
63
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
64
  * _Upcasting to `torch.float32`_
65
 
66
  ```python
 
127
 
128
  * _Flash Attention 2_
129
 
130
+ > [!WARNING]
131
+ > Gemma 2 is currently incompatible with Flash Attention/ SDPA, using it might result in unreliable generations. Use at your own risk.
132
+
133
  First make sure to install `flash-attn` in your environment `pip install flash-attn`
134
 
135
  ```diff
 
189
  inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
190
  outputs = model.generate(input_ids=inputs.to(model.device), max_new_tokens=150)
191
  print(tokenizer.decode(outputs[0]))
192
+ ```