Could you show the benchmark loss for this quantization?

#4
by TomLucidor - opened

I am curious how the perplexity loss might cause some of the agentic tasks to fail more often.

cyankiwi org

Thanks for raising the problem to me. The perplexity measurement was made on wikitext dataset, and therefore, not covering tool usage and agentic task.

I am recently aware that the quantized models struggle in tool usage and agentic tasks, which might be a result from the lack of tool calling and agentic calibration dataset. The model was calibrated using nvidia/Nemotron-Post-Training-Dataset-v2, and nvidia/Nemotron-Post-Training-Dataset-v2 does not have tool calling and agentic calibration data.

I will look into this in more detail, and solve this problem soon.

Ideally include benchmarks for most 4bit quants cus it is easier to see what might get broken by accident

cyankiwi org

Yes, I intend to complete full evaluations of my models, but currently I'm limited by my resources.

how many samples of the post-training-dataset do you use for the calibration?

cyankiwi org

@whadupapp For calibration, I use the 256 samples from nvidia/Nemotron-Post-Training-Dataset-v2 dataset, with tokens routed to every experts for calibration. Do you also get this problem?

@TomLucidor Could you tell me in more details of the failed agentic tasks? Do the tool-calling and agentic outputs get mixed in its thinking traces, and ultimately lead to agentic task failure?

I was just wondering how the number or the selecton of samples impacts the downstream performance of the model

I did notice the mixing of thinking and toolcalling for the glm4.5 awq quant, though

@cpatonn there are many failure modes e.g. </think> tag parsing, failing to understand the prompt, thought/action loops, mojibake/jibberish due to quantization/artifacts, etc...
(I am also looking to see if Qwen3-Coder-Next-REAP and Kimi-Linear has similar issues, since liner models are likely faster)

cyankiwi org

Thanks for the info. I stumble across this and I think it might be beneficial in your case.

Sign up or log in to comment