LlamaFinetuneBase
/

Gemma-2-27B-Instruct

@@ -8,11 +8,8 @@ extra_gated_prompt: >-
   Google’s usage license. To do this, please ensure you’re logged in to Hugging
   Face and click below. Requests are processed immediately.
 extra_gated_button_content: Acknowledge license
-base_model: google/gemma-2-27b
 ---
 # Gemma 2 model card
 **Model Page**: [Gemma](https://ai.google.dev/gemma/docs)
@@ -23,7 +20,7 @@ base_model: google/gemma-2-27b
 * [Gemma on Kaggle][kaggle-gemma]
 * [Gemma on Vertex Model Garden][vertex-mg-gemma]
-**Terms of Use**: [Terms](https://www.kaggle.com/models/google/gemma/license/consent/verify/huggingface?returnModelRepoId=google/gemma-2-27b-it)
 **Authors**: Google
@@ -60,19 +57,14 @@ from transformers import pipeline
 pipe = pipeline(
     "text-generation",
-    model="google/gemma-2-27b-it",
-    model_kwargs={"torch_dtype": torch.bfloat16},
     device="cuda",  # replace with "mps" to run on a Mac device
 )
-messages = [
-    {"role": "user", "content": "Who are you? Please, answer in pirate-speak."},
-]
-outputs = pipe(messages, max_new_tokens=256)
-assistant_response = outputs[0]["generated_text"][-1]["content"].strip()
-print(assistant_response)
-# Ahoy, matey! I be Gemma, a digital scallywag, a language-slingin' parrot of the digital seas. I be here to help ye with yer wordy woes, answer yer questions, and spin ye yarns of the digital world.  So, what be yer pleasure, eh? 🦜
 ```
 #### Running the model on a single / multi GPU
@@ -82,47 +74,9 @@ print(assistant_response)
 from transformers import AutoTokenizer, AutoModelForCausalLM
 import torch
-tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-27b-it")
 model = AutoModelForCausalLM.from_pretrained(
-    "google/gemma-2-27b-it",
-    device_map="auto",
-    torch_dtype=torch.bfloat16,
-)
-input_text = "Write me a poem about Machine Learning."
-input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
-outputs = model.generate(**input_ids, max_new_tokens=32)
-print(tokenizer.decode(outputs[0]))
-```
-You can ensure the correct chat template is applied by using `tokenizer.apply_chat_template` as follows:
-```python
-messages = [
-    {"role": "user", "content": "Write me a poem about Machine Learning."},
-]
-input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt", return_dict=True).to("cuda")
-outputs = model.generate(**input_ids, max_new_tokens=256)
-print(tokenizer.decode(outputs[0]))
-```
-<a name="precisions"></a>
-#### Running the model on a GPU using different precisions
-The native weights of this model were exported in `bfloat16` precision.
-You can also use `float32` if you skip the dtype, but no precision increase will occur (model weights will just be upcasted to `float32`). See examples below.
-* _Upcasting to `torch.float32`_
-```python
-# pip install accelerate
-from transformers import AutoTokenizer, AutoModelForCausalLM
-tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-27b-it")
-model = AutoModelForCausalLM.from_pretrained(
-    "google/gemma-2-27b-it",
     device_map="auto",
 )
@@ -140,7 +94,7 @@ for running Gemma 2 through a command line interface, or CLI. Follow the [instal
 for getting started, then launch the CLI through the following command:
 ```shell
-local-gemma --model 27b --preset speed
 ```
 #### Quantized Versions through `bitsandbytes`
@@ -156,9 +110,9 @@ from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
 quantization_config = BitsAndBytesConfig(load_in_8bit=True)
-tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-27b-it")
 model = AutoModelForCausalLM.from_pretrained(
-    "google/gemma-2-27b-it",
     quantization_config=quantization_config,
 )
@@ -181,9 +135,9 @@ from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
 quantization_config = BitsAndBytesConfig(load_in_4bit=True)
-tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-27b-it")
 model = AutoModelForCausalLM.from_pretrained(
-    "google/gemma-2-27b-it",
     quantization_config=quantization_config,
 )
@@ -218,8 +172,8 @@ import torch
 torch.set_float32_matmul_precision("high")
 # load the model + tokenizer
-tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-27b-it")
-model = Gemma2ForCausalLM.from_pretrained("google/gemma-2-27b-it", torch_dtype=torch.bfloat16)
 model.to("cuda")
 # apply the torch compile transformation
@@ -257,57 +211,6 @@ For more details, refer to the [Transformers documentation](https://huggingface.
 </details>
-### Chat Template
-The instruction-tuned models use a chat template that must be adhered to for conversational use.
-The easiest way to apply it is using the tokenizer's built-in chat template, as shown in the following snippet.
-Let's load the model and apply the chat template to a conversation. In this example, we'll start with a single user interaction:
-```py
-from transformers import AutoTokenizer, AutoModelForCausalLM
-import transformers
-import torch
-model_id = "google/gemma-2-27b-it"
-dtype = torch.bfloat16
-tokenizer = AutoTokenizer.from_pretrained(model_id)
-model = AutoModelForCausalLM.from_pretrained(
-    model_id,
-    device_map="cuda",
-    torch_dtype=dtype,
-)
-chat = [
-    { "role": "user", "content": "Write a hello world program" },
-]
-prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
-```
-At this point, the prompt contains the following text:
-```
-<bos><start_of_turn>user
-Write a hello world program<end_of_turn>
-<start_of_turn>model
-```
-As you can see, each turn is preceded by a `<start_of_turn>` delimiter and then the role of the entity
-(either `user`, for content supplied by the user, or `model` for LLM responses). Turns finish with
-the `<end_of_turn>` token.
-You can follow this format to build the prompt manually, if you need to do it without the tokenizer's
-chat template.
-After the prompt is ready, generation can be performed like this:
-```py
-inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
-outputs = model.generate(input_ids=inputs.to(model.device), max_new_tokens=150)
-print(tokenizer.decode(outputs[0]))
-```
 ### Inputs and outputs
 *   **Input:** Text string, such as a question, a prompt, or a document to be

   Google’s usage license. To do this, please ensure you’re logged in to Hugging
   Face and click below. Requests are processed immediately.
 extra_gated_button_content: Acknowledge license
 ---
 # Gemma 2 model card
 **Model Page**: [Gemma](https://ai.google.dev/gemma/docs)
 * [Gemma on Kaggle][kaggle-gemma]
 * [Gemma on Vertex Model Garden][vertex-mg-gemma]
+**Terms of Use**: [Terms](https://www.kaggle.com/models/google/gemma/license/consent/verify/huggingface?returnModelRepoId=google/gemma-2-27b)
 **Authors**: Google
 pipe = pipeline(
     "text-generation",
+    model="google/gemma-2-27b",
     device="cuda",  # replace with "mps" to run on a Mac device
 )
+text = "Once upon a time,"
+outputs = pipe(text, max_new_tokens=256)
+response = outputs[0]["generated_text"]
+print(response)
 ```
 #### Running the model on a single / multi GPU
 from transformers import AutoTokenizer, AutoModelForCausalLM
 import torch
+tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-27b")
 model = AutoModelForCausalLM.from_pretrained(
+    "google/gemma-2-27b",
     device_map="auto",
 )
 for getting started, then launch the CLI through the following command:
 ```shell
+local-gemma --model "google/gemma-2-27b" --prompt "What is the capital of Mexico?"
 ```
 #### Quantized Versions through `bitsandbytes`
 quantization_config = BitsAndBytesConfig(load_in_8bit=True)
+tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-27b")
 model = AutoModelForCausalLM.from_pretrained(
+    "google/gemma-2-27b",
     quantization_config=quantization_config,
 )
 quantization_config = BitsAndBytesConfig(load_in_4bit=True)
+tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-27b")
 model = AutoModelForCausalLM.from_pretrained(
+    "google/gemma-2-27b",
     quantization_config=quantization_config,
 )
 torch.set_float32_matmul_precision("high")
 # load the model + tokenizer
+tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-27b")
+model = Gemma2ForCausalLM.from_pretrained("google/gemma-2-27b", torch_dtype=torch.bfloat16)
 model.to("cuda")
 # apply the torch compile transformation
 </details>
 ### Inputs and outputs
 *   **Input:** Text string, such as a question, a prompt, or a document to be