Quantizations for llama.cpp

#7
by rozek - opened

Thank you very much (again) for this marvellous work! Being able to use long contexts for analyzing texts with LLMs is really important!

In order to use your model with llama.cpp, I've (again) generated some quantizations in GGUF format.

Assuming, that the prompt has the format described in the model card, the Q8_0 quantization performs pretty well - on the other side, the Q4_0 quantization hallucinates far too much.

But, with 24GB of RAM, llama.cpp can now handle contexts up to the recommended limit of 32k when using the Q8_0 quantization - that's really cool!

Sign up or log in to comment