PPLX or KLD, or other benchmark

#4
by HenkTenk - opened

Hey, could you maybe post KLD or PPLX or other benchmarks to compare the quant to the original checkpoint?

For GGUFs there is alot of information available but for AWQs benchmark info is scarce.
https://www.reddit.com/r/LocalLLaMA/comments/1so5nrl/qwen36_gguf_benchmarks/

I also see there is a PR made to measure it in VLLM, but isn't merged yet:
https://www.reddit.com/r/LocalLLaMA/comments/1rkmvo4/i_added_ppl_and_kld_to_vllm_review_rfc_and_pr_and/

For SGlang i found this, quite involved, method:
https://github.com/voipmonitor/rtx6kpro/blob/master/benchmarks/kld-evaluation.md

QuantTrio org
edited Apr 21

Thanks for the suggestion and for sharing these links.

We do think benchmarks like KLD / PPLX are meaningful. For now though, we’re trying to stay a bit cautious. Since many of these open weight models come from commercial companies, we’d prefer not to publish systematic benchmark numbers ourselves, just to avoid possible legal/compliance issues.

Really appreciate the community helping out with benchmarking and feedback though — that’s super valuable for future quantization work.

Sign up or log in to comment