---
language:
- en
license: apache-2.0
tags:
- text-generation-inference
- transformers
- unsloth
- gemma2
- trl
base_model: EpistemeAI/Athena-gemma-2-9b-it
pipeline_tag: text-generation
model-index:
- name: Athena-gemma-2-2b-it
  results:
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: IFEval (0-Shot)
      type: HuggingFaceH4/ifeval
      args:
        num_few_shot: 0
    metrics:
    - type: inst_level_strict_acc and prompt_level_strict_acc
      value: 29.22
      name: strict accuracy
    source:
      url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=EpistemeAI/Athena-gemma-2-2b-it
      name: Open LLM Leaderboard
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: BBH (3-Shot)
      type: BBH
      args:
        num_few_shot: 3
    metrics:
    - type: acc_norm
      value: 19.07
      name: normalized accuracy
    source:
      url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=EpistemeAI/Athena-gemma-2-2b-it
      name: Open LLM Leaderboard
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: MATH Lvl 5 (4-Shot)
      type: hendrycks/competition_math
      args:
        num_few_shot: 4
    metrics:
    - type: exact_match
      value: 3.32
      name: exact match
    source:
      url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=EpistemeAI/Athena-gemma-2-2b-it
      name: Open LLM Leaderboard
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: GPQA (0-shot)
      type: Idavidrein/gpqa
      args:
        num_few_shot: 0
    metrics:
    - type: acc_norm
      value: 2.35
      name: acc_norm
    source:
      url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=EpistemeAI/Athena-gemma-2-2b-it
      name: Open LLM Leaderboard
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: MuSR (0-shot)
      type: TAUR-Lab/MuSR
      args:
        num_few_shot: 0
    metrics:
    - type: acc_norm
      value: 14.49
      name: acc_norm
    source:
      url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=EpistemeAI/Athena-gemma-2-2b-it
      name: Open LLM Leaderboard
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: MMLU-PRO (5-shot)
      type: TIGER-Lab/MMLU-Pro
      config: main
      split: test
      args:
        num_few_shot: 5
    metrics:
    - type: acc
      value: 15.86
      name: accuracy
    source:
      url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?query=EpistemeAI/Athena-gemma-2-2b-it
      name: Open LLM Leaderboard
---
### Description

#### Gemma 2 2b instruct is the new code and reasoning supervied fine tuning version of the Gemma.

## Original Model description
Gemma is a family of lightweight, state-of-the-art open models from Google,
built from the same research and technology used to create the Gemini models.
They are text-to-text, decoder-only large language models, available in English,
with open weights for both pre-trained variants and instruction-tuned variants.
Gemma models are well-suited for a variety of text generation tasks, including
question answering, summarization, and reasoning. Their relatively small size
makes it possible to deploy them in environments with limited resources such as
a laptop, desktop or your own cloud infrastructure, democratizing access to
state of the art AI models and helping foster innovation for everyone.

### Training data

Athena-Gemma 2 2b instruct was SFT tuned by EpistemeAI's own high level reasoning, how-to-code, basic, and advance python codes alpaca dataset.

### Usage

Below we share some code snippets on how to get quickly started with running the model. First, install the Transformers library with:
```sh
pip install -U transformers
```

Then, copy the snippet from the section that is relevant for your usecase.

#### Running with the `pipeline` API

```python
import torch
from transformers import pipeline
pipe = pipeline(
    "text-generation",
    model="EpistemeAI/Athena-gemma-2-2b-it",
    model_kwargs={"torch_dtype": torch.bfloat16},
    device="cuda",  # replace with "mps" to run on a Mac device
)
messages = [
    {"role": "user", "content": "Who are you? Please, answer in pirate-speak."},
]
outputs = pipe(messages, max_new_tokens=256)
assistant_response = outputs[0]["generated_text"][-1]["content"].strip()
print(assistant_response)
# Ahoy, matey! I be Gemma, a digital scallywag, a language-slingin' parrot of the digital seas. I be here to help ye with yer wordy woes, answer yer questions, and spin ye yarns of the digital world.  So, what be yer pleasure, eh? 🦜
```

#### Running the model on a single / multi GPU

```python
# pip install accelerate
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
tokenizer = AutoTokenizer.from_pretrained("EpistemeAI/Athena-gemma-2-2b-it")
model = AutoModelForCausalLM.from_pretrained(
    "EpistemeAI/Athena-gemma-2-2b-it",
    device_map="auto",
    torch_dtype=torch.bfloat16,
)
input_text = "Write bubble sort in python code."
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
outputs = model.generate(**input_ids, max_new_tokens=32)
print(tokenizer.decode(outputs[0]))
```

You can ensure the correct chat template is applied by using `tokenizer.apply_chat_template` as follows:
```python
messages = [
    {"role": "user", "content": "Write me a poem about Machine Learning."},
]
input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt", return_dict=True).to("cuda")
outputs = model.generate(**input_ids, max_new_tokens=256)
print(tokenizer.decode(outputs[0]))
```

<a name="precisions"></a>
#### Running the model on a GPU using different precisions

The native weights of this model were exported in `bfloat16` precision.

You can also use `float32` if you skip the dtype, but no precision increase will occur (model weights will just be upcasted to `float32`). See examples below.

* _Upcasting to `torch.float32`_

```python
# pip install accelerate
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("EpistemeAI/Athena-gemma-2-2b-it")
model = AutoModelForCausalLM.from_pretrained(
    "EpistemeAI/Athena-gemma-2-2b-it",
    device_map="auto",
)
input_text = "Write me a poem about Machine Learning."
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
outputs = model.generate(**input_ids, max_new_tokens=32)
print(tokenizer.decode(outputs[0]))
```

#### Running the model through a CLI

The [local-gemma](https://github.com/huggingface/local-gemma) repository contains a lightweight wrapper around Transformers
for running Gemma 2 through a command line interface, or CLI. Follow the [installation instructions](https://github.com/huggingface/local-gemma#cli-usage)
for getting started, then launch the CLI through the following command:

```shell
local-gemma --model 9b --preset speed
```

#### Quantized Versions through `bitsandbytes`

<details>
  <summary>
    Using 8-bit precision (int8)  
  </summary>

```python

# pip install bitsandbytes accelerate
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
tokenizer = AutoTokenizer.from_pretrained("EpistemeAI/Athena-gemma-2-2b-it")
model = AutoModelForCausalLM.from_pretrained(
    "EpistemeAI/Athena-gemma-2-2b-it",
    quantization_config=quantization_config,
)
input_text = "Write for loop 10x of hello world in python."
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
outputs = model.generate(**input_ids, max_new_tokens=32)
print(tokenizer.decode(outputs[0]))
```
</details>

<details>
  <summary>
    Using 4-bit precision  
  </summary>
  
```python

# pip install bitsandbytes accelerate
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(load_in_4bit=True)
tokenizer = AutoTokenizer.from_pretrained("EpistemeAI/Athena-gemma-2-2b-it")
model = AutoModelForCausalLM.from_pretrained(
    "EpistemeAI/Athena-gemma-2-2b-it",
    quantization_config=quantization_config,
)
input_text = "Write for loop 10x of hello world in python."
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
outputs = model.generate(**input_ids, max_new_tokens=32)
print(tokenizer.decode(outputs[0]))
```

</details>

#### Advanced Usage

<details>
  <summary>
    Torch compile  
  </summary>
[Torch compile](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) is a method for speeding-up the 
inference of PyTorch modules. The Gemma-2 model can be run up to 6x faster by leveraging torch compile.

Note that two warm-up steps are required before the full inference speed is realised:

```python
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"
from transformers import AutoTokenizer, Gemma2ForCausalLM
from transformers.cache_utils import HybridCache
import torch
torch.set_float32_matmul_precision("high")
# load the model + tokenizer
tokenizer = AutoTokenizer.from_pretrained("EpistemeAI/Athena-gemma-2-2b-it")
model = Gemma2ForCausalLM.from_pretrained("EpistemeAI/Athena-gemma-2-2b-it", torch_dtype=torch.bfloat16)
model.to("cuda")
# apply the torch compile transformation
model.forward = torch.compile(model.forward, mode="reduce-overhead", fullgraph=True)
# pre-process inputs
input_text = "The theory of special relativity states "
model_inputs = tokenizer(input_text, return_tensors="pt").to("cuda")
prompt_length = model_inputs.input_ids.shape[1]
# set-up k/v cache
past_key_values = HybridCache(
    config=model.config,
    max_batch_size=1,
    max_cache_len=model.config.max_position_embeddings,
    device=model.device,
    dtype=model.dtype
)
# enable passing kv cache to generate
model._supports_cache_class = True
model.generation_config.cache_implementation = None
# two warm-up steps
for idx in range(2):
    outputs = model.generate(**model_inputs, past_key_values=past_key_values, do_sample=True, temperature=1.0, max_new_tokens=128)
    past_key_values.reset()
# fast run
outputs = model.generate(**model_inputs, past_key_values=past_key_values, do_sample=True, temperature=1.0, max_new_tokens=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

For more details, refer to the [Transformers documentation](https://huggingface.co/docs/transformers/main/en/llm_optims?static-kv=basic+usage%3A+generation_config).

</details>

### Chat Template

The instruction-tuned models use a chat template that must be adhered to for conversational use.
The easiest way to apply it is using the tokenizer's built-in chat template, as shown in the following snippet.

Let's load the model and apply the chat template to a conversation. In this example, we'll start with a single user interaction:

```py
from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch
model_id = "EpistemeAI/Athena-gemma-2-2b-it"
dtype = torch.bfloat16
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="cuda",
    torch_dtype=dtype,)
chat = [
    { "role": "user", "content": "Write a hello world program" },
]
prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
```

At this point, the prompt contains the following text:

```
<bos><start_of_turn>user
Write a hello world program<end_of_turn>
<start_of_turn>model
```

As you can see, each turn is preceded by a `<start_of_turn>` delimiter and then the role of the entity
(either `user`, for content supplied by the user, or `model` for LLM responses). Turns finish with
the `<end_of_turn>` token.

You can follow this format to build the prompt manually, if you need to do it without the tokenizer's
chat template.

After the prompt is ready, generation can be performed like this:

```py
inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
outputs = model.generate(input_ids=inputs.to(model.device), max_new_tokens=150)
print(tokenizer.decode(outputs[0]))
```

### Inputs and outputs

*   **Input:** Text string, such as a question, a prompt, or a document to be
    summarized.
*   **Output:** Generated English-language text in response to the input, such
    as an answer to a question, or a summary of a document.


# Uploaded  model

- **Developed by:** EpistemeAI2
- **License:** apache-2.0
- **Finetuned from model :** EpistemeAI/Athena-gemma-2-9b-it

This gemma2 model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.

[<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)

### Notice:
Gemma is provided under and subject to the Gemma Terms of Use found at ai.google.dev/gemma/terms
# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard)
Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_EpistemeAI__Athena-gemma-2-2b-it)

|      Metric       |Value|
|-------------------|----:|
|Avg.               |14.05|
|IFEval (0-Shot)    |29.22|
|BBH (3-Shot)       |19.07|
|MATH Lvl 5 (4-Shot)| 3.32|
|GPQA (0-shot)      | 2.35|
|MuSR (0-shot)      |14.49|
|MMLU-PRO (5-shot)  |15.86|