Me again

#2
by qenme - opened

Hey, me again just happened to be looking at your model card and I noticed this "With these settings, you get around 129k context. You can also add --kv-cache-dtype fp8_e4m3 --calculate-kv-scales args to get about 252k tokens." I wasn't aware of the calculate kv scales so I did some research and it's been deprecated. It silently corrupts on Qwen3.5 (https://github.com/vllm-project/vllm/pull/37565).

Also, as a follow up to my previous discussion, I ended up being able to recreate this quant in llama-cpp and it works very well.

/home/user/llm/mtp/llama.cpp/build/bin/llama-quantize
--tensor-type token_embd=bf16
--tensor-type output=bf16
--tensor-type output_norm=bf16
--tensor-type post_attention_norm=bf16
--tensor-type attn_q_norm=bf16
--tensor-type attn_k_norm=bf16
--tensor-type attn_qkv=bf16
--tensor-type attn_gate=bf16
--tensor-type ssm_a=bf16
--tensor-type ssm_alpha=bf16
--tensor-type ssm_beta=bf16
--tensor-type ssm_conv1d=bf16
--tensor-type ssm_dt.bias=bf16
--tensor-type ssm_norm=bf16
--tensor-type ssm_out=bf16
/home/user/llm/models/Qwen3.6-27B/BF16/Qwen3.6-27B-BF16-MTP.gguf
/home/user/llm/models/Qwen3.6-27B/BF16/Qwen3.6-27B-Q8-BIGBOY.gguf
q8_0

Updated the readme, thank you!

Also, you might want to consider uploading it to huggingface yourself. I am sure others would find it useful. I don't use llama.cpp anymore since it lacked MTP and tensor parallelism support, which doesn't suit my use case, but it's still widely used by many people.

I may upload it. 30GB will take an overnight session, lol. Llama CPP now supports tensor and MTP. Although the new MTP feature degrades PP a bit. Overall, they are improving well. Thanks again for introducing me to this quant. I'll keep a look out for future quants from you.

Sign up or log in to comment