gemma-2-27b-it-imatrix-GGUF / README.md

duyntnet

Update README.md

557f526 verified 16 days ago

preview code

raw

history blame

No virus

6.59 kB

	---
	license: other
	language:
	- en
	pipeline_tag: text-generation
	inference: false
	tags:
	- transformers
	- gguf
	- imatrix
	- gemma-2-27b-it
	---
	Quantizations of https://huggingface.co/google/gemma-2-27b-it

	Note: All quants are created using latest [llama.cpp release](https://github.com/ggerganov/llama.cpp/releases) (b3266). This version (hopefully) fixes all Gemma 2 27B problems. You will need the latest version of llama.cpp to use these quants.


	# From original readme

	### Usage

	Below we share some code snippets on how to get quickly started with running the model. First make sure to `pip install -U transformers`, then copy the snippet from the section that is relevant for your usecase.


	#### Running the model on a single / multi GPU


	```python
	# pip install accelerate
	from transformers import AutoTokenizer, AutoModelForCausalLM
	import torch

	tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-27b-it")
	model = AutoModelForCausalLM.from_pretrained(
	"google/gemma-2-27b-it",
	device_map="auto",
	torch_dtype=torch.bfloat16
	)

	input_text = "Write me a poem about Machine Learning."
	input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

	outputs = model.generate(**input_ids)
	print(tokenizer.decode(outputs[0]))
	```

	<a name="precisions"></a>
	#### Running the model on a GPU using different precisions

	The native weights of this model were exported in `bfloat16` precision. You can use `float16`, which may be faster on certain hardware, indicating the `torch_dtype` when loading the model. For convenience, the `float16` revision of the repo contains a copy of the weights already converted to that precision.

	You can also use `float32` if you skip the dtype, but no precision increase will occur (model weights will just be upcasted to `float32`). See examples below.

	* _Using `torch.float16`_

	```python
	# pip install accelerate
	from transformers import AutoTokenizer, AutoModelForCausalLM
	import torch

	tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-27b-it")
	model = AutoModelForCausalLM.from_pretrained(
	"google/gemma-2-27b-it",
	device_map="auto",
	torch_dtype=torch.float16,
	revision="float16",
	)

	input_text = "Write me a poem about Machine Learning."
	input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

	outputs = model.generate(**input_ids)
	print(tokenizer.decode(outputs[0]))
	```

	* _Using `torch.bfloat16`_

	```python
	# pip install accelerate
	from transformers import AutoTokenizer, AutoModelForCausalLM

	tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-27b-it")
	model = AutoModelForCausalLM.from_pretrained(
	"google/gemma-2-27b-it",
	device_map="auto",
	torch_dtype=torch.bfloat16)

	input_text = "Write me a poem about Machine Learning."
	input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

	outputs = model.generate(**input_ids)
	print(tokenizer.decode(outputs[0]))
	```

	* _Upcasting to `torch.float32`_

	```python
	# pip install accelerate
	from transformers import AutoTokenizer, AutoModelForCausalLM

	tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-27b-it")
	model = AutoModelForCausalLM.from_pretrained(
	"google/gemma-2-27b-it",
	device_map="auto"
	)

	input_text = "Write me a poem about Machine Learning."
	input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

	outputs = model.generate(**input_ids)
	print(tokenizer.decode(outputs[0]))
	```

	#### Quantized Versions through `bitsandbytes`

	* _Using 8-bit precision (int8)_

	```python
	# pip install bitsandbytes accelerate
	from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

	quantization_config = BitsAndBytesConfig(load_in_8bit=True)

	tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-27b-it")
	model = AutoModelForCausalLM.from_pretrained(
	"google/gemma-2-27b-it",
	quantization_config=quantization_config)

	input_text = "Write me a poem about Machine Learning."
	input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

	outputs = model.generate(**input_ids)
	print(tokenizer.decode(outputs[0]))
	```

	* _Using 4-bit precision_

	```python
	# pip install bitsandbytes accelerate
	from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

	quantization_config = BitsAndBytesConfig(load_in_4bit=True)

	tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-27b-it")
	model = AutoModelForCausalLM.from_pretrained(
	"google/gemma-2-27b-it",
	quantization_config=quantization_config)

	input_text = "Write me a poem about Machine Learning."
	input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

	outputs = model.generate(**input_ids)
	print(tokenizer.decode(outputs[0]))
	```


	#### Other optimizations

	* _Flash Attention 2_

	First make sure to install `flash-attn` in your environment `pip install flash-attn`

	```diff
	model = AutoModelForCausalLM.from_pretrained(
	model_id,
	torch_dtype=torch.float16,
	+ attn_implementation="flash_attention_2"
	).to(0)
	```

	### Chat Template

	The instruction-tuned models use a chat template that must be adhered to for conversational use.
	The easiest way to apply it is using the tokenizer's built-in chat template, as shown in the following snippet.

	Let's load the model and apply the chat template to a conversation. In this example, we'll start with a single user interaction:

	```py
	from transformers import AutoTokenizer, AutoModelForCausalLM
	import transformers
	import torch

	model_id = "google/gemma-2-27b-it"
	dtype = torch.bfloat16

	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForCausalLM.from_pretrained(
	model_id,
	device_map="cuda",
	torch_dtype=dtype,
	)

	chat = [
	{ "role": "user", "content": "Write a hello world program" },
	]
	prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
	```

	At this point, the prompt contains the following text:

	```
	<bos><start_of_turn>user
	Write a hello world program<end_of_turn>
	<start_of_turn>model
	```

	As you can see, each turn is preceded by a `<start_of_turn>` delimiter and then the role of the entity
	(either `user`, for content supplied by the user, or `model` for LLM responses). Turns finish with
	the `<end_of_turn>` token.

	You can follow this format to build the prompt manually, if you need to do it without the tokenizer's
	chat template.

	After the prompt is ready, generation can be performed like this:

	```py
	inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
	outputs = model.generate(input_ids=inputs.to(model.device), max_new_tokens=150)
	print(tokenizer.decode(outputs[0]))
	```