rahuldshetty
/

gemma-7b-it-gguf-quantized

Transformers

GGUF

Inference Endpoints

Model card Files Files and versions Community

rahuldshetty commited on Feb 21, 2024

Commit

eb9a94e

verified ·

1 Parent(s): 0e62e79

Update README.md

Browse files

Files changed (1) hide show

README.md +66 -18

README.md CHANGED Viewed

@@ -20,13 +20,13 @@ GGUF Quantized version of [gemma-7b-it](https://huggingface.co/google/gemma-7b-i
 **Model Page**: [Gemma](https://ai.google.dev/gemma/docs)
-This model card corresponds to the 2B base version of the Gemma model. You can also visit the model card of the [7B base model](https://huggingface.co/google/gemma-7b), [7B instruct model](https://huggingface.co/google/gemma-7b-it), and [2B instruct model](https://huggingface.co/google/gemma-2b-it).
 **Resources and Technical Documentation**:
 * [Responsible Generative AI Toolkit](https://ai.google.dev/responsible)
 * [Gemma on Kaggle](https://www.kaggle.com/models/google/gemma)
-* [Gemma on Vertex Model Garden](https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/335?version=gemma-2b-gg-hf)
 **Terms of Use**: [Terms](https://www.kaggle.com/models/google/gemma/license/consent)
@@ -52,10 +52,9 @@ state of the art AI models and helping foster innovation for everyone.
 Below we share some code snippets on how to get quickly started with running the model. First make sure to `pip install -U transformers`, then copy the snippet from the section that is relevant for your usecase.
 #### Fine-tuning the model
-You can find fine-tuning scripts and notebook under the [`examples/` directory](https://huggingface.co/google/gemma-7b/tree/main/examples) of [`google/gemma-7b`](https://huggingface.co/google/gemma-7b) repository. To adapt it to this model, simply change the model-id to `google/gemma-2b`.
 In that repository, we provide:
 * A script to perform Supervised Fine-Tuning (SFT) on UltraChat dataset using QLoRA
@@ -63,15 +62,14 @@ In that repository, we provide:
 * A notebook that you can run on a free-tier Google Colab instance to perform SFT on English quotes dataset
 #### Running the model on a CPU
 ```python
 from transformers import AutoTokenizer, AutoModelForCausalLM
-tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b")
-model = AutoModelForCausalLM.from_pretrained("google/gemma-2b")
 input_text = "Write me a poem about Machine Learning."
 input_ids = tokenizer(**input_text, return_tensors="pt")
@@ -88,8 +86,8 @@ print(tokenizer.decode(outputs[0]))
 # pip install accelerate
 from transformers import AutoTokenizer, AutoModelForCausalLM
-tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b")
-model = AutoModelForCausalLM.from_pretrained("google/gemma-2b", device_map="auto")
 input_text = "Write me a poem about Machine Learning."
 input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
@@ -107,8 +105,8 @@ print(tokenizer.decode(outputs[0]))
 # pip install accelerate
 from transformers import AutoTokenizer, AutoModelForCausalLM
-tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b")
-model = AutoModelForCausalLM.from_pretrained("google/gemma-2b", device_map="auto", torch_dtype=torch.float16)
 input_text = "Write me a poem about Machine Learning."
 input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
@@ -123,8 +121,8 @@ print(tokenizer.decode(outputs[0]))
 # pip install accelerate
 from transformers import AutoTokenizer, AutoModelForCausalLM
-tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b")
-model = AutoModelForCausalLM.from_pretrained("google/gemma-2b", device_map="auto", torch_dtype=torch.bfloat16)
 input_text = "Write me a poem about Machine Learning."
 input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
@@ -143,8 +141,8 @@ from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
 quantization_config = BitsAndBytesConfig(load_in_8bit=True)
-tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b")
-model = AutoModelForCausalLM.from_pretrained("google/gemma-2b", quantization_config=quantization_config)
 input_text = "Write me a poem about Machine Learning."
 input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
@@ -161,8 +159,8 @@ from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
 quantization_config = BitsAndBytesConfig(load_in_4bit=True)
-tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b")
-model = AutoModelForCausalLM.from_pretrained("google/gemma-2b", quantization_config=quantization_config)
 input_text = "Write me a poem about Machine Learning."
 input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
@@ -186,6 +184,56 @@ model = AutoModelForCausalLM.from_pretrained(
 ).to(0)
 ```
 ### Inputs and outputs
 *   **Input:** Text string, such as a question, a prompt, or a document to be
@@ -260,7 +308,7 @@ several advantages in this domain:
 ### Software
-Training was done using [JAX](https://github.com/google/jax) and [ML Pathways](https://blog.google/technology/ai/introducing-pathways-next-generation-ai-architecture/ml-pathways).
 JAX allows researchers to take advantage of the latest generation of hardware,
 including TPUs, for faster and more efficient training of large models.

 **Model Page**: [Gemma](https://ai.google.dev/gemma/docs)
+This model card corresponds to the 7B instruct version of the Gemma model. You can also visit the model card of the [2B base model](https://huggingface.co/google/gemma-2b), [7B base model](https://huggingface.co/google/gemma-7b), and [2B instruct model](https://huggingface.co/google/gemma-2b-it).
 **Resources and Technical Documentation**:
 * [Responsible Generative AI Toolkit](https://ai.google.dev/responsible)
 * [Gemma on Kaggle](https://www.kaggle.com/models/google/gemma)
+* [Gemma on Vertex Model Garden](https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/335?version=gemma-7b-it-gg-hf)
 **Terms of Use**: [Terms](https://www.kaggle.com/models/google/gemma/license/consent)
 Below we share some code snippets on how to get quickly started with running the model. First make sure to `pip install -U transformers`, then copy the snippet from the section that is relevant for your usecase.
 #### Fine-tuning the model
+You can find fine-tuning scripts and notebook under the [`examples/` directory](https://huggingface.co/google/gemma-7b/tree/main/examples) of [`google/gemma-7b`](https://huggingface.co/google/gemma-7b) repository. To adapt it to this model, simply change the model-id to `google/gemma-7b-it`.
 In that repository, we provide:
 * A script to perform Supervised Fine-Tuning (SFT) on UltraChat dataset using QLoRA
 * A notebook that you can run on a free-tier Google Colab instance to perform SFT on English quotes dataset
 #### Running the model on a CPU
 ```python
 from transformers import AutoTokenizer, AutoModelForCausalLM
+tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b-it")
+model = AutoModelForCausalLM.from_pretrained("google/gemma-7b-it")
 input_text = "Write me a poem about Machine Learning."
 input_ids = tokenizer(**input_text, return_tensors="pt")
 # pip install accelerate
 from transformers import AutoTokenizer, AutoModelForCausalLM
+tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b-it")
+model = AutoModelForCausalLM.from_pretrained("google/gemma-7b-it", device_map="auto")
 input_text = "Write me a poem about Machine Learning."
 input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
 # pip install accelerate
 from transformers import AutoTokenizer, AutoModelForCausalLM
+tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b-it")
+model = AutoModelForCausalLM.from_pretrained("google/gemma-7b-it", device_map="auto", torch_dtype=torch.float16)
 input_text = "Write me a poem about Machine Learning."
 input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
 # pip install accelerate
 from transformers import AutoTokenizer, AutoModelForCausalLM
+tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b-it")
+model = AutoModelForCausalLM.from_pretrained("google/gemma-7b-it", device_map="auto", torch_dtype=torch.bfloat16)
 input_text = "Write me a poem about Machine Learning."
 input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
 quantization_config = BitsAndBytesConfig(load_in_8bit=True)
+tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b-it")
+model = AutoModelForCausalLM.from_pretrained("google/gemma-7b-it", quantization_config=quantization_config)
 input_text = "Write me a poem about Machine Learning."
 input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
 quantization_config = BitsAndBytesConfig(load_in_4bit=True)
+tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b-it")
+model = AutoModelForCausalLM.from_pretrained("google/gemma-7b-it", quantization_config=quantization_config)
 input_text = "Write me a poem about Machine Learning."
 input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
 ).to(0)
 ```
+### Chat Template
+The instruction-tuned models use a chat template that must be adhered to for conversational use.
+The easiest way to apply it is using the tokenizer's built-in chat template, as shown in the following snippet.
+Let's load the model and apply the chat template to a conversation. In this example, we'll start with a single user interaction:
+```py
+from transformers import AutoTokenizer, AutoModelForCausalLM
+import transformers
+import torch
+model_id = "gg-hf/gemma-7b-it"
+dtype = torch.bfloat16
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    device_map="cuda",
+    torch_dtype=dtype,
+)
+chat = [
+    { "role": "user", "content": "Write a hello world program" },
+]
+prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
+```
+At this point, the prompt contains the following text:
+```
+<start_of_turn>user
+Write a hello world program<end_of_turn>
+<start_of_turn>model
+```
+As you can see, each turn is preceeded by a `<start_of_turn>` delimiter and then the role of the entity
+(either `user`, for content supplied by the user, or `model` for LLM responses). Turns finish with
+the `<end_of_turn>` token.
+You can follow this format to build the prompt manually, if you need to do it without the tokenizer's
+chat template.
+After the prompt is ready, generation can be performed like this:
+```py
+inputs = tokenizer.encode(prompt, add_special_tokens=True, return_tensors="pt")
+outputs = model.generate(input_ids=inputs.to(model.device), max_new_tokens=150)
+```
 ### Inputs and outputs
 *   **Input:** Text string, such as a question, a prompt, or a document to be
 ### Software
+Training was done using [JAX](https://github.com/google/jax) and [ML Pathways](https://blog.google/technology/ai/introducing-pathways-next-generation-ai-architecture).
 JAX allows researchers to take advantage of the latest generation of hardware,
 including TPUs, for faster and more efficient training of large models.