duyntnet
/

gemma-2-27b-it-imatrix-GGUF

@@ -12,8 +12,16 @@ tags:
 ---
 Quantizations of https://huggingface.co/google/gemma-2-27b-it
-**Note**: All quants are created using latest [llama.cpp release](https://github.com/ggerganov/llama.cpp/releases) (b3266). This version (hopefully) fixes all Gemma 2 27B problems. You will need the latest version of llama.cpp to use these quants.
 # From original readme
@@ -24,6 +32,8 @@ Below we share some code snippets on how to get quickly started with running the
 #### Running the model on a single / multi GPU
 ```python
 # pip install accelerate
@@ -47,51 +57,10 @@ print(tokenizer.decode(outputs[0]))
 <a name="precisions"></a>
 #### Running the model on a GPU using different precisions
-The native weights of this model were exported in `bfloat16` precision. You can use `float16`, which may be faster on certain hardware, indicating the `torch_dtype` when loading the model. For convenience, the `float16` revision of the repo contains a copy of the weights already converted to that precision.
 You can also use `float32` if you skip the dtype, but no precision increase will occur (model weights will just be upcasted to `float32`). See examples below.
-* _Using `torch.float16`_
-```python
-# pip install accelerate
-from transformers import AutoTokenizer, AutoModelForCausalLM
-import torch
-tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-27b-it")
-model = AutoModelForCausalLM.from_pretrained(
-    "google/gemma-2-27b-it",
-    device_map="auto",
-    torch_dtype=torch.float16,
-    revision="float16",
-)
-input_text = "Write me a poem about Machine Learning."
-input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
-outputs = model.generate(**input_ids)
-print(tokenizer.decode(outputs[0]))
-```
-* _Using `torch.bfloat16`_
-```python
-# pip install accelerate
-from transformers import AutoTokenizer, AutoModelForCausalLM
-tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-27b-it")
-model = AutoModelForCausalLM.from_pretrained(
-    "google/gemma-2-27b-it",
-    device_map="auto",
-    torch_dtype=torch.bfloat16)
-input_text = "Write me a poem about Machine Learning."
-input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
-outputs = model.generate(**input_ids)
-print(tokenizer.decode(outputs[0]))
-```
 * _Upcasting to `torch.float32`_
 ```python
@@ -158,6 +127,9 @@ print(tokenizer.decode(outputs[0]))
 * _Flash Attention 2_
 First make sure to install `flash-attn` in your environment `pip install flash-attn`
 ```diff
@@ -217,4 +189,4 @@ After the prompt is ready, generation can be performed like this:
 inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
 outputs = model.generate(input_ids=inputs.to(model.device), max_new_tokens=150)
 print(tokenizer.decode(outputs[0]))
-```

 ---
 Quantizations of https://huggingface.co/google/gemma-2-27b-it
+Update (July 8, 2024): **Requantized and reuploaded** using llama.cpp latest version (b3325), everything should work as expected.
+### Inference Clients/UIs
+* [llama.cpp](https://github.com/ggerganov/llama.cpp)
+* [JanAI](https://github.com/janhq/jan)
+* [KoboldCPP](https://github.com/LostRuins/koboldcpp)
+* [text-generation-webui](https://github.com/oobabooga/text-generation-webui)
+* [ollama](https://github.com/ollama/ollama)
+---
 # From original readme
 #### Running the model on a single / multi GPU
+> [!IMPORTANT]
+> Given the model instabilities with SDPA/ FA2, by default, the model inference would utilise `eager` attention.
 ```python
 # pip install accelerate
 <a name="precisions"></a>
 #### Running the model on a GPU using different precisions
+The native weights of this model were exported in `bfloat16` precision.
 You can also use `float32` if you skip the dtype, but no precision increase will occur (model weights will just be upcasted to `float32`). See examples below.
 * _Upcasting to `torch.float32`_
 ```python
 * _Flash Attention 2_
+> [!WARNING]
+> Gemma 2 is currently incompatible with Flash Attention/ SDPA, using it might result in unreliable generations. Use at your own risk.
 First make sure to install `flash-attn` in your environment `pip install flash-attn`
 ```diff
 inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
 outputs = model.generate(input_ids=inputs.to(model.device), max_new_tokens=150)
 print(tokenizer.decode(outputs[0]))
+```