Better than expected!
For the ones looking for the sweet spot, after some prompt engineering, I was able to get decent responses using q6_k. Lowering the quality more than this results in poor performance. It seems like intelligence or the so-called internal world representation emerges on ~835MB.
What is the difference between a cow and a dog?
A cow is a farm animal, while a dog is a member of the canine family.
I'm definitely using it for my future projects.
would you mind sharing your {system} prompt? I'm getting very inconsistent responses from the model and formatting seems correct here? ... Its always hit or miss for me.
from llama_cpp import Llama
llm = Llama(model_path="models/tinyllama-1.1b-chat-v0.3.Q6_K.gguf", verbose=False)
prompt = "What is the difference between a cow and a dog?"
s_system = "The clear answer for this question would be:"
formatted_prompt = f"<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant\n{s_system}"
stream = llm(formatted_prompt,
top_k=50, repeat_penalty=1.1, top_p=0.9, max_tokens=64,
stop=["", ". ", "<|im_end|>"], echo=True, stream=True)
for s in stream:
print(s['choices'][0]['text'], end='')
i cannot recall the prompt (probably didn't use any anyway) but always use low temperature and play with top_p until you get something similar to what you expect