Text Generation
Transformers
GGUF
English
Chinese
llama
llama2
qwen
text-generation-inference

No Q_K quants?

#1
by TheYuriLover - opened

The tensors have to be a multiple of the k-quants block size to use k-quants. LLaMA models usually fit that requirements, but the 14B here doesn't. (Technically there's a way to use k-quants anyway but it requires compiling with a special flag to quantize and load the models and you lose some of the advantage of k-quants that way also.)
There is also another issue with the conversion where the BPE merges didn't get added to the GGUF files (both 7b and 14b as far as I know) so you can't load the models. This is not TB's fault. But I suggest waiting for a fixed version before trying to download them. Associated GitHub issue: https://github.com/ggerganov/llama.cpp/issues/3732 edit: Should be fixed now.

There's a pr to change this behavior https://github.com/ggerganov/llama.cpp/pull/3747

Sign up or log in to comment