Chat Template Format

#2
by madhucharan - opened

Hi, I dont understand why chat template is like below in your README.md

Simple inference example

output = llm(
"<|im_start|>system
{system_message}<|im_end|>
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant", # Prompt
max_tokens=512, # Generate up to 512 tokens
stop=[""], # Example stop token - not necessarily correct for this specific model! Please check before using.
echo=True # Whether to echo the prompt
)

Isn't the format should be like below?

<start_of_turn>user
{{ .Prompt }}<end_of_turn>
<start_of_turn>model
{{ .Response }}<end_of_turn>

I want to know what has changed the format and what to use of these both

Hi, I dont understand why chat template is like below in your README.md

Simple inference example

output = llm(
"<|im_start|>system
{system_message}<|im_end|>
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant", # Prompt
max_tokens=512, # Generate up to 512 tokens
stop=[""], # Example stop token - not necessarily correct for this specific model! Please check before using.
echo=True # Whether to echo the prompt
)

Isn't the format should be like below?

<start_of_turn>user
{{ .Prompt }}<end_of_turn>
<start_of_turn>model
{{ .Response }}<end_of_turn>

I want to know what has changed the format and what to use of these both

Hi,
Sorry, the README is just a template that is used for all the models. It's just an example rather than an exact instruction. Thanks for sharing this here, I'll leave it open for others in case it helps.

Alright, Also since there is no comparision of quant models, Im confused about which one to pick. Can you suggest best ones of these if you have any idea? Also are GEMMA-IT models are better for prompting questions or GEMMA models are better? I believe -it ones are better and they understand the language and respond accordingly? I mainly want to generate solutions for coding prompts.

https://mlabonne.github.io/blog/posts/Quantize_Llama_2_models_using_ggml.html

q2_k: Uses Q4_K for the attention.vw and feed_forward.w2 tensors, Q2_K for the other tensors.
q3_k_l: Uses Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K
q3_k_m: Uses Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K
q3_k_s: Uses Q3_K for all tensors
q4_0: Original quant method, 4-bit.
q4_1: Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models.
q4_k_m: Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K
q4_k_s: Uses Q4_K for all tensors
q5_0: Higher accuracy, higher resource usage and slower inference.
q5_1: Even higher accuracy, resource usage and slower inference.
q5_k_m: Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K
q5_k_s: Uses Q5_K for all tensors
q6_k: Uses Q8_K for all tensors
q8_0: Almost indistinguishable from float16. High resource use and slow. Not recommended for most users.

Gemma is a base model, it's best for fine-tuning on new datasets and tasks. The Gemma IT is the instruct model you can prompt. However, I am not sure how good can be a 7B model for coding solution. I would give this model also a shot and pick the best one for your use case: https://huggingface.co/MaziyarPanahi/Mistral-7B-Instruct-v0.2-GGUF

I already use mistral and other models, i want to use gemma as I'm comparing outputs of 100s of example codes of different llms. So i want to pick a recommended quantized gemma model for the same.

Sign up or log in to comment