tobchef/Qwen1.5-14B-Chat-Q2_K-GGUF

This model was converted to GGUF format from Qwen/Qwen1.5-14B-Chat using llama.cpp via the ggml.ai's GGUF-my-repo space. Refer to the original model card for more details on the model.

Use with llama.cpp

Install llama.cpp through brew.

brew install ggerganov/ggerganov/llama.cpp

Invoke the llama.cpp server or the CLI. CLI:

llama-cli --hf-repo tobchef/Qwen1.5-14B-Chat-Q2_K-GGUF --model qwen1.5-14b-chat-q2_k.gguf -p "The meaning to life and the universe is"

Server:

llama-server --hf-repo tobchef/Qwen1.5-14B-Chat-Q2_K-GGUF --model qwen1.5-14b-chat-q2_k.gguf -c 2048

Note: You can also use this checkpoint directly through the usage steps listed in the Llama.cpp repo as well.

git clone https://github.com/ggerganov/llama.cpp && \
cd llama.cpp && \
make && \
./main -m qwen1.5-14b-chat-q2_k.gguf -n 128

补充说明

qwen官方的gguf似乎是使用AWQ版本量化的，我不清楚这样是否更好，不过PPL会变得异常高：

1. 官方qwen1.5-14b-chat-q8_0的PPL：  15.5670
2. 官方qwen1.5-14b-chat-IQ2_S的PPL： 16.6220
3. 我的qwen1.5-14b-chat-q2_k的PPL：  11.4125

另外在我的NVIDIA GeForce RTX 3060 Ti上，IQ3_S以上会超过8G显存虽然可以通过共享内存运行，但速度断崖式下跌，所以对我来说q2_k是生成速度和质量最为经济的选择：

| model                          |       size |     params | backend    | ngl |         fa |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | ------------: | ---------------: |
| qwen2 13B Q8_0                 |  14.02 GiB |    14.17 B | CUDA       |  99 |          1 |         tg128 |      1.33 ± 0.01 |
| qwen2 13B IQ3_S - 3.4375 bpw   |   6.30 GiB |    14.17 B | CUDA       |  99 |          1 |         tg128 |     39.37 ± 0.33 |
| qwen2 13B Q2_K - Medium        |   5.50 GiB |    14.17 B | CUDA       |  99 |          1 |         tg128 |     44.94 ± 0.17 |
| qwen2 13B IQ2_S - 2.5 bpw      |   5.20 GiB |    14.17 B | CUDA       |  99 |          1 |         tg128 |     42.95 ± 0.31 |
| qwen2 13B IQ2_XS - 2.3125 bpw  |   4.89 GiB |    14.17 B | CUDA       |  99 |          1 |         tg128 |     44.14 ± 0.04 |
| qwen2 13B IQ1_M - 1.75 bpw     |   4.35 GiB |    14.17 B | CUDA       |  99 |          1 |         tg128 |     45.51 ± 0.28 |
| qwen2 13B IQ1_S - 1.5625 bpw   |   4.18 GiB |    14.17 B | CUDA       |  99 |          1 |         tg128 |     48.64 ± 0.42 |

最后对于LLM来说，模型质量和参数规模决定了生成质量：

Qwen1.5-14B-Chat在中文数据集上的表现优于LLaMA-2-70B-Chat
qwen1.5-14b-chat-IQ1_S 优于 qwen1.5-7b-chat-q8_0

所以选择建议就是：

使用测评榜单（如opencompass）筛选用最小参数量实现最优得分的模型
利用llama.cpp量化的低硬件需求优势，通过压低量化质量，跑你硬件能正常速度跑的最大参数规模的模型