GGUF Q8 output quality

#1
by opendev - opened

I'm running locally a quantized version of the aya-35B with a maximum quant Q8 (48GB VRAM), but the local output quality is much lower than the quality of the model in this space. The difference is too noticeable. The question is "Why"? I haven't noticed such a big difference when using other models at maximum quants.

Prompt format I'm using locally:
<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>
{system_prompt}
<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|USER_TOKEN|>
{prompt}
<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>

Sign up or log in to comment