FM-1976

Update README.md

ce235a5 verified 4 months ago

8.63 kB

	---
	base_model: google/gemma-2-2b-it
	library_name: transformers
	license: gemma
	pipeline_tag: text-generation
	tags:
	- conversational
	- llama-cpp
	- gguf-my-repo
	extra_gated_heading: Access Gemma on Hugging Face
	extra_gated_prompt: To access Gemma on Hugging Face, you’re required to review and
	agree to Google’s usage license. To do this, please ensure you’re logged in to Hugging
	Face and click below. Requests are processed immediately.
	extra_gated_button_content: Acknowledge license
	---

	<img src='https://github.com/fabiomatricardi/Gemma2-2b-it-chatbot/raw/main/images/gemma2-2b-myGGUF.png' width=900>
	<br><br><br>

	# FM-1976/gemma-2-2b-it-Q5_K_M-GGUF
	This model was converted to GGUF format from [`google/gemma-2-2b-it`](https://huggingface.co/google/gemma-2-2b-it) using llama.cpp via the ggml.ai's [GGUF-my-repo](https://huggingface.co/spaces/ggml-org/gguf-my-repo) space.
	Refer to the [original model card](https://huggingface.co/google/gemma-2-2b-it) for more details on the model.


	## Description
	Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models. They are text-to-text, decoder-only large language models, available in English, with open weights for both pre-trained variants and instruction-tuned variants. Gemma models are well-suited for a variety of text generation tasks, including question answering, summarization, and reasoning. Their relatively small size makes it possible to deploy them in environments with limited resources such as a laptop, desktop or your own cloud infrastructure, democratizing access to state of the art AI models and helping foster innovation for everyone.

	## Model Details
	context window = 8192
	SYSTEM MESSAGE NOT SUPPORTED
	```bash
	architecture str = gemma2
	type str = model
	name str = Gemma 2 2b It
	finetune str = it
	basename str = gemma-2
	size_label str = 2B
	license str = gemma
	count u32 = 1
	model.0.name str = Gemma 2 2b
	organization str = Google
	format = GGUF V3 (latest)
	arch = gemma2
	vocab type = SPM
	n_vocab = 256000
	n_merges = 0
	vocab_only = 0
	n_ctx_train = 8192
	n_embd = 2304
	n_layer = 26
	n_head = 8
	n_head_kv = 4
	model type = 2B
	model ftype = Q5_K - Medium
	model params = 2.61 B
	model size = 1.79 GiB (5.87 BPW)
	general.name = Gemma 2 2b It
	BOS token = 2 '<bos>'
	EOS token = 1 '<eos>'
	UNK token = 3 '<unk>'
	PAD token = 0 '<pad>'
	LF token = 227 '<0x0A>'
	EOT token = 107 '<end_of_turn>'
	EOG token = 1 '<eos>'
	EOG token = 107 '<end_of_turn>'

	>>> System role not supported
	Available chat formats from metadata: chat_template.default
	Using gguf chat template: {{ bos_token }}{% if messages[0]['role'] == 'system' %}{{ raise_exception('System role not supported') }}{% endif %}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if (message['role'] == 'assistant') %}{% set role = 'model' %}{% else %}{% set role = message['role'] %}{% endif %}{{ '<start_of_turn>' + role + '
	' + message['content'] \| trim + '<end_of_turn>
	' }}{% endfor %}{% if add_generation_prompt %}{{'<start_of_turn>model
	'}}{% endif %}
	Using chat eos_token: <eos>
	Using chat bos_token: <bos>

	```



	### Prompt Format
	```pthon
	<bos><start_of_turn>user
	{prompt}<end_of_turn>
	<start_of_turn>model
	<end_of_turn>
	```

	## Chat Template
	The instruction-tuned models use a chat template that must be adhered to for conversational use. The easiest way to apply it is using the tokenizer's built-in chat template, as shown in the following snippet.

	```python
	messages = [
	{"role": "user", "content": "Write me a poem about Machine Learning."},
	]
	```
	## Use with llama-cpp-python
	Install llama.cpp through brew (works on Mac and Linux)

	```bash
	pip install llama-cpp-python

	```
	### Download locally the GGUF file
	```bash
	wget https://huggingface.co/FM-1976/gemma-2-2b-it-Q5_K_M-GGUF/resolve/main/gemma-2-2b-it-q5_k_m.gguf -OutFile gemma-2-2b-it-q5_k_m.gguf

	```

	### Open your Python REPL

	#### Using chat_template
	```python
	from llama_cpp import Llama
	nCTX = 8192
	sTOPS = ['<eos>']
	llm = Llama(
	model_path='gemma-2-2b-it-q5_k_m.gguf',
	temperature=0.24,
	n_ctx=nCTX,
	max_tokens=600,
	repeat_penalty=1.176,
	stop=sTOPS,
	verbose=False,
	)
	messages = [
	{"role": "user", "content": "Write me a poem about Machine Learning."},
	]
	response = llm.create_chat_completion(
	messages=messages,
	temperature=0.15,
	repeat_penalty= 1.178,
	stop=sTOPS,
	max_tokens=500)
	print(response['choices'][0]['message']['content'])
	```

	#### Using create_completion
	```python
	from llama_cpp import Llama
	nCTX = 8192
	sTOPS = ['<eos>']
	llm = Llama(
	model_path='gemma-2-2b-it-q5_k_m.gguf',
	temperature=0.24,
	n_ctx=nCTX,
	max_tokens=600,
	repeat_penalty=1.176,
	stop=sTOPS,
	verbose=False,
	)
	prompt = 'Explain Science in one sentence.'
	template = f'''<bos><start_of_turn>user
	{prompt}<end_of_turn>
	<start_of_turn>model
	<end_of_turn>'''
	res = llm.create_completion(prompt,temperature=0.15, max_tokens=500,repeat_penalty=1.178, stop=['<eos>'])
	print(res['choices'][0]['text'])
	```


	### Streaming text
	llama-cpp-python allows you to also stream text during the inference<br>
	Tokens are decoded and printed soon after gneration is done. You don't have to wait until the entire inference is done.
	<br><br>
	You can use both `create_chat_completion()` and `create_completion()` methods.
	<br>

	#### Streaming with `create_chat_completion()` method
	```python
	import datetime
	from llama_cpp import Llama
	nCTX = 8192
	sTOPS = ['<eos>']
	llm = Llama(
	model_path='gemma-2-2b-it-q5_k_m.gguf',
	temperature=0.24,
	n_ctx=nCTX,
	max_tokens=600,
	repeat_penalty=1.176,
	stop=sTOPS,
	verbose=False,
	)
	fisrtround=0
	full_response = ''
	message = [{'role':'user','content':'what is science?'}]
	start = datetime.datetime.now()
	for chunk in llm.create_chat_completion(
	messages=message,
	temperature=0.15,
	repeat_penalty= 1.31,
	stop=['<eos>'],
	max_tokens=500,
	stream=True,):
	try:
	if chunk["choices"][0]["delta"]["content"]:
	if fisrtround==0:
	print(chunk["choices"][0]["delta"]["content"], end="", flush=True)
	full_response += chunk["choices"][0]["delta"]["content"]
	ttftoken = datetime.datetime.now() - start
	fisrtround = 1
	else:
	print(chunk["choices"][0]["delta"]["content"], end="", flush=True)
	full_response += chunk["choices"][0]["delta"]["content"]
	except:
	pass
	first_token_time = ttftoken.total_seconds()
	print(f'Time to first token: {first_token_time:.2f} seconds')
	```

	#### Streaming with `create_completion()` method

	```python
	import datetime
	from llama_cpp import Llama
	nCTX = 8192
	sTOPS = ['<eos>']
	llm = Llama(
	model_path='gemma-2-2b-it-q5_k_m.gguf',
	temperature=0.24,
	n_ctx=nCTX,
	max_tokens=600,
	repeat_penalty=1.176,
	stop=sTOPS,
	verbose=False,
	)
	fisrtround=0
	full_response = ''
	prompt = 'Explain Science in one sentence.'
	template = f'''<bos><start_of_turn>user
	{prompt}<end_of_turn>
	<start_of_turn>model
	<end_of_turn>'''
	start = datetime.datetime.now()
	for chunk in llm.create_completion(
	prompt,
	temperature=0.15,
	repeat_penalty= 1.78,
	stop=['<eos>'],
	max_tokens=500,
	stream=True,):
	if fisrtround==0:
	print(chunk["choices"][0]["text"], end="", flush=True)
	full_response += chunk["choices"][0]["text"]
	ttftoken = datetime.datetime.now() - start
	fisrtround = 1
	else:
	print(chunk["choices"][0]["text"], end="", flush=True)
	full_response += chunk["choices"][0]["text"]

	first_token_time = ttftoken.total_seconds()
	print(f'Time to first token: {first_token_time:.2f} seconds')
	```

	### Further exploration
	You can also serve the model with an OpenAI compliant API server<br>
	This can be done both with `llama-cpp-python[server]` and `llamafile`.