README.md · dranger003/zephyr-7b-gemma-v0.1-GGUF at 40e3c2b8b27e815199811941f0c4ac3fb2778694

metadata

license: other
license_name: gemma-terms-of-use
license_link: https://ai.google.dev/gemma/terms

GGUF quants for https://huggingface.co/HuggingFaceH4/zephyr-7b-gemma-v0.1

Zephyr is a series of language models that are trained to act as helpful assistants. Zephyr 7B Gemma is the third model in the series, and is a fine-tuned version of google/gemma-7b that was trained on on a mix of publicly available, synthetic datasets using Direct Preference Optimization (DPO). You can reproduce the training of this model via the recipe provided in the Alignment Handbook.

There are few things to consider when using this model with llama.cpp:

Special tokens <|im_start|> and <|im_end|> are not properly mapped as overrides of <start_of_turn> and <end_of_turn> (issue in the GGUF)
Repeat penalty must 1.0 (i.e. disabled) just like with the base model
The model was not trained with the system instructions (i.e. don't add the system instructions part of the chatml template)
Must stop on special token <end_of_turn> instead of <eos> otherwise the model goes on forever

Here's a setup that seems to work quite well to chat with the model. The Q4_K is very fast and gives ~90 t/s on a 3090 full offloaded:

./main -ins -r "<end_of_turn>" --color -e --in-prefix "<start_of_turn>user\n" --in-suffix "<end_of_turn>\n<start_of_turn>assistant\n" -c 0 --temp 0.7 --repeat-penalty 1.0 -ngl 29 -m ggml-zephyr-7b-gemma-v0.1-q4_k.gguf

Layers	Context	Template
28	8192	<\|im_start\|>user {prompt}<\|im_end\|> <\|im_start\|>assistant {response}