tobchef/Qwen1.5-14B-Chat-Q2_K-GGUF
This model was converted to GGUF format from Qwen/Qwen1.5-14B-Chat
using llama.cpp via the ggml.ai's GGUF-my-repo space.
Refer to the original model card for more details on the model.
Use with llama.cpp
Install llama.cpp through brew.
brew install ggerganov/ggerganov/llama.cpp
Invoke the llama.cpp server or the CLI. CLI:
llama-cli --hf-repo tobchef/Qwen1.5-14B-Chat-Q2_K-GGUF --model qwen1.5-14b-chat-q2_k.gguf -p "The meaning to life and the universe is"
Server:
llama-server --hf-repo tobchef/Qwen1.5-14B-Chat-Q2_K-GGUF --model qwen1.5-14b-chat-q2_k.gguf -c 2048
Note: You can also use this checkpoint directly through the usage steps listed in the Llama.cpp repo as well.
git clone https://github.com/ggerganov/llama.cpp && \
cd llama.cpp && \
make && \
./main -m qwen1.5-14b-chat-q2_k.gguf -n 128
补充说明
qwen官方的gguf似乎是使用AWQ版本量化的,我不清楚这样是否更好,不过PPL会变得异常高:
1. 官方qwen1.5-14b-chat-q8_0的PPL: 15.5670
2. 官方qwen1.5-14b-chat-IQ2_S的PPL: 16.6220
3. 我的qwen1.5-14b-chat-q2_k的PPL: 11.4125
另外在我的NVIDIA GeForce RTX 3060 Ti上,IQ3_S以上会超过8G显存虽然可以通过共享内存运行,但速度断崖式下跌,所以对我来说q2_k是生成速度和质量最为经济的选择:
| model | size | params | backend | ngl | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | ------------: | ---------------: |
| qwen2 13B Q8_0 | 14.02 GiB | 14.17 B | CUDA | 99 | 1 | tg128 | 1.33 ± 0.01 |
| qwen2 13B IQ3_S - 3.4375 bpw | 6.30 GiB | 14.17 B | CUDA | 99 | 1 | tg128 | 39.37 ± 0.33 |
| qwen2 13B Q2_K - Medium | 5.50 GiB | 14.17 B | CUDA | 99 | 1 | tg128 | 44.94 ± 0.17 |
| qwen2 13B IQ2_S - 2.5 bpw | 5.20 GiB | 14.17 B | CUDA | 99 | 1 | tg128 | 42.95 ± 0.31 |
| qwen2 13B IQ2_XS - 2.3125 bpw | 4.89 GiB | 14.17 B | CUDA | 99 | 1 | tg128 | 44.14 ± 0.04 |
| qwen2 13B IQ1_M - 1.75 bpw | 4.35 GiB | 14.17 B | CUDA | 99 | 1 | tg128 | 45.51 ± 0.28 |
| qwen2 13B IQ1_S - 1.5625 bpw | 4.18 GiB | 14.17 B | CUDA | 99 | 1 | tg128 | 48.64 ± 0.42 |
最后对于LLM来说,模型质量和参数规模决定了生成质量:
- Qwen1.5-14B-Chat在中文数据集上的表现优于LLaMA-2-70B-Chat
- qwen1.5-14b-chat-IQ1_S 优于 qwen1.5-7b-chat-q8_0
所以选择建议就是:
- 使用测评榜单(如
opencompass
)筛选用最小参数量实现最优得分的模型 - 利用llama.cpp量化的低硬件需求优势,通过压低量化质量,跑你硬件能正常速度跑的最大参数规模的模型
- Downloads last month
- 9