|
--- |
|
base_model: google/gemma-2-2b-it |
|
library_name: transformers |
|
license: gemma |
|
pipeline_tag: text-generation |
|
tags: |
|
- conversational |
|
- llama-cpp |
|
- gguf-my-repo |
|
extra_gated_heading: Access Gemma on Hugging Face |
|
extra_gated_prompt: To access Gemma on Hugging Face, you’re required to review and |
|
agree to Google’s usage license. To do this, please ensure you’re logged in to Hugging |
|
Face and click below. Requests are processed immediately. |
|
extra_gated_button_content: Acknowledge license |
|
--- |
|
|
|
<img src='https://github.com/fabiomatricardi/Gemma2-2b-it-chatbot/raw/main/images/gemma2-2b-myGGUF.png' width=900> |
|
<br><br><br> |
|
|
|
# FM-1976/gemma-2-2b-it-Q5_K_M-GGUF |
|
This model was converted to GGUF format from [`google/gemma-2-2b-it`](https://huggingface.co/google/gemma-2-2b-it) using llama.cpp via the ggml.ai's [GGUF-my-repo](https://huggingface.co/spaces/ggml-org/gguf-my-repo) space. |
|
Refer to the [original model card](https://huggingface.co/google/gemma-2-2b-it) for more details on the model. |
|
|
|
|
|
## Description |
|
Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models. They are text-to-text, decoder-only large language models, available in English, with open weights for both pre-trained variants and instruction-tuned variants. Gemma models are well-suited for a variety of text generation tasks, including question answering, summarization, and reasoning. Their relatively small size makes it possible to deploy them in environments with limited resources such as a laptop, desktop or your own cloud infrastructure, democratizing access to state of the art AI models and helping foster innovation for everyone. |
|
|
|
## Model Details |
|
context window = 8192 |
|
SYSTEM MESSAGE NOT SUPPORTED |
|
```bash |
|
architecture str = gemma2 |
|
type str = model |
|
name str = Gemma 2 2b It |
|
finetune str = it |
|
basename str = gemma-2 |
|
size_label str = 2B |
|
license str = gemma |
|
count u32 = 1 |
|
model.0.name str = Gemma 2 2b |
|
organization str = Google |
|
format = GGUF V3 (latest) |
|
arch = gemma2 |
|
vocab type = SPM |
|
n_vocab = 256000 |
|
n_merges = 0 |
|
vocab_only = 0 |
|
n_ctx_train = 8192 |
|
n_embd = 2304 |
|
n_layer = 26 |
|
n_head = 8 |
|
n_head_kv = 4 |
|
model type = 2B |
|
model ftype = Q5_K - Medium |
|
model params = 2.61 B |
|
model size = 1.79 GiB (5.87 BPW) |
|
general.name = Gemma 2 2b It |
|
BOS token = 2 '<bos>' |
|
EOS token = 1 '<eos>' |
|
UNK token = 3 '<unk>' |
|
PAD token = 0 '<pad>' |
|
LF token = 227 '<0x0A>' |
|
EOT token = 107 '<end_of_turn>' |
|
EOG token = 1 '<eos>' |
|
EOG token = 107 '<end_of_turn>' |
|
|
|
>>> System role not supported |
|
Available chat formats from metadata: chat_template.default |
|
Using gguf chat template: {{ bos_token }}{% if messages[0]['role'] == 'system' %}{{ raise_exception('System role not supported') }}{% endif %}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if (message['role'] == 'assistant') %}{% set role = 'model' %}{% else %}{% set role = message['role'] %}{% endif %}{{ '<start_of_turn>' + role + ' |
|
' + message['content'] | trim + '<end_of_turn> |
|
' }}{% endfor %}{% if add_generation_prompt %}{{'<start_of_turn>model |
|
'}}{% endif %} |
|
Using chat eos_token: <eos> |
|
Using chat bos_token: <bos> |
|
|
|
``` |
|
|
|
|
|
|
|
### Prompt Format |
|
```pthon |
|
<bos><start_of_turn>user |
|
{prompt}<end_of_turn> |
|
<start_of_turn>model |
|
<end_of_turn> |
|
``` |
|
|
|
## Chat Template |
|
The instruction-tuned models use a chat template that must be adhered to for conversational use. The easiest way to apply it is using the tokenizer's built-in chat template, as shown in the following snippet. |
|
|
|
```python |
|
messages = [ |
|
{"role": "user", "content": "Write me a poem about Machine Learning."}, |
|
] |
|
``` |
|
## Use with llama-cpp-python |
|
Install llama.cpp through brew (works on Mac and Linux) |
|
|
|
```bash |
|
pip install llama-cpp-python |
|
|
|
``` |
|
### Download locally the GGUF file |
|
```bash |
|
wget https://huggingface.co/FM-1976/gemma-2-2b-it-Q5_K_M-GGUF/resolve/main/gemma-2-2b-it-q5_k_m.gguf -OutFile gemma-2-2b-it-q5_k_m.gguf |
|
|
|
``` |
|
|
|
### Open your Python REPL |
|
|
|
#### Using chat_template |
|
```python |
|
from llama_cpp import Llama |
|
nCTX = 8192 |
|
sTOPS = ['<eos>'] |
|
llm = Llama( |
|
model_path='gemma-2-2b-it-q5_k_m.gguf', |
|
temperature=0.24, |
|
n_ctx=nCTX, |
|
max_tokens=600, |
|
repeat_penalty=1.176, |
|
stop=sTOPS, |
|
verbose=False, |
|
) |
|
messages = [ |
|
{"role": "user", "content": "Write me a poem about Machine Learning."}, |
|
] |
|
response = llm.create_chat_completion( |
|
messages=messages, |
|
temperature=0.15, |
|
repeat_penalty= 1.178, |
|
stop=sTOPS, |
|
max_tokens=500) |
|
print(response['choices'][0]['message']['content']) |
|
``` |
|
|
|
#### Using create_completion |
|
```python |
|
from llama_cpp import Llama |
|
nCTX = 8192 |
|
sTOPS = ['<eos>'] |
|
llm = Llama( |
|
model_path='gemma-2-2b-it-q5_k_m.gguf', |
|
temperature=0.24, |
|
n_ctx=nCTX, |
|
max_tokens=600, |
|
repeat_penalty=1.176, |
|
stop=sTOPS, |
|
verbose=False, |
|
) |
|
prompt = 'Explain Science in one sentence.' |
|
template = f'''<bos><start_of_turn>user |
|
{prompt}<end_of_turn> |
|
<start_of_turn>model |
|
<end_of_turn>''' |
|
res = llm.create_completion(prompt,temperature=0.15, max_tokens=500,repeat_penalty=1.178, stop=['<eos>']) |
|
print(res['choices'][0]['text']) |
|
``` |
|
|
|
|
|
### Streaming text |
|
llama-cpp-python allows you to also stream text during the inference<br> |
|
Tokens are decoded and printed soon after gneration is done. You don't have to wait until the entire inference is done. |
|
<br><br> |
|
You can use both `create_chat_completion()` and `create_completion()` methods. |
|
<br> |
|
|
|
#### Streaming with `create_chat_completion()` method |
|
```python |
|
import datetime |
|
from llama_cpp import Llama |
|
nCTX = 8192 |
|
sTOPS = ['<eos>'] |
|
llm = Llama( |
|
model_path='gemma-2-2b-it-q5_k_m.gguf', |
|
temperature=0.24, |
|
n_ctx=nCTX, |
|
max_tokens=600, |
|
repeat_penalty=1.176, |
|
stop=sTOPS, |
|
verbose=False, |
|
) |
|
fisrtround=0 |
|
full_response = '' |
|
message = [{'role':'user','content':'what is science?'}] |
|
start = datetime.datetime.now() |
|
for chunk in llm.create_chat_completion( |
|
messages=message, |
|
temperature=0.15, |
|
repeat_penalty= 1.31, |
|
stop=['<eos>'], |
|
max_tokens=500, |
|
stream=True,): |
|
try: |
|
if chunk["choices"][0]["delta"]["content"]: |
|
if fisrtround==0: |
|
print(chunk["choices"][0]["delta"]["content"], end="", flush=True) |
|
full_response += chunk["choices"][0]["delta"]["content"] |
|
ttftoken = datetime.datetime.now() - start |
|
fisrtround = 1 |
|
else: |
|
print(chunk["choices"][0]["delta"]["content"], end="", flush=True) |
|
full_response += chunk["choices"][0]["delta"]["content"] |
|
except: |
|
pass |
|
first_token_time = ttftoken.total_seconds() |
|
print(f'Time to first token: {first_token_time:.2f} seconds') |
|
``` |
|
|
|
#### Streaming with `create_completion()` method |
|
|
|
```python |
|
import datetime |
|
from llama_cpp import Llama |
|
nCTX = 8192 |
|
sTOPS = ['<eos>'] |
|
llm = Llama( |
|
model_path='gemma-2-2b-it-q5_k_m.gguf', |
|
temperature=0.24, |
|
n_ctx=nCTX, |
|
max_tokens=600, |
|
repeat_penalty=1.176, |
|
stop=sTOPS, |
|
verbose=False, |
|
) |
|
fisrtround=0 |
|
full_response = '' |
|
prompt = 'Explain Science in one sentence.' |
|
template = f'''<bos><start_of_turn>user |
|
{prompt}<end_of_turn> |
|
<start_of_turn>model |
|
<end_of_turn>''' |
|
start = datetime.datetime.now() |
|
for chunk in llm.create_completion( |
|
prompt, |
|
temperature=0.15, |
|
repeat_penalty= 1.78, |
|
stop=['<eos>'], |
|
max_tokens=500, |
|
stream=True,): |
|
if fisrtround==0: |
|
print(chunk["choices"][0]["text"], end="", flush=True) |
|
full_response += chunk["choices"][0]["text"] |
|
ttftoken = datetime.datetime.now() - start |
|
fisrtround = 1 |
|
else: |
|
print(chunk["choices"][0]["text"], end="", flush=True) |
|
full_response += chunk["choices"][0]["text"] |
|
|
|
first_token_time = ttftoken.total_seconds() |
|
print(f'Time to first token: {first_token_time:.2f} seconds') |
|
``` |
|
|
|
### Further exploration |
|
You can also serve the model with an OpenAI compliant API server<br> |
|
This can be done both with `llama-cpp-python[server]` and `llamafile`. |
|
|
|
|
|
|
|
|
|
|