You may want to add an "act-order" GPTQ quantization.

#22

by xzuyn - opened May 1, 2023

May 1, 2023

YellowRose#1776 on the Kobold discord did some testing on Pygmalion 7B to see which gives the best perplexity, and he found that act-order is best.

deleted

May 1, 2023

I'll start by saying I'm not against having a Triton GPTQ quant.

I'd really like to see --act-order --true-sequential --groupsize 128 versus --true-sequential --groupsize 128. And I'd like to see that testing against 13B and 30B models. It's my understanding that perplexity gains are fairly parameter dependent for --act-order and the gains are around ~0.1 at 13B (please correct me if I'm wrong). Any additional info would be great. I'm also curious what the meaningful output shift results are of a 0.1 perplexity gain (I know people are saying act-order helps with rare contraction issues around words like couldn't), so if anyone has some good examples of how that 0.1 manifests I think that would be good to share around.

xzuyn

May 1, 2023

Any additional info would be great. I'm also curious what the meaningful output shift results are of a 0.1 perplexity gain (I know people are saying act-order helps with rare contraction issues around words like couldn't), so if anyone has some good examples of how that 0.1 manifests I think that would be good to share around.

Maybe @Monero /YellowRose#1776 could chime in and give some more info.

Monero

May 1, 2023

Those results are for the instruct version Pygmalion Metharme 7b, In my experience the best combination depends on the model
I'm not familiar with perplexity results beyond that atm, sorry!

deleted

May 1, 2023

No worries and thanks for compiling the data you did! Hopefully someone will compile meaningful differences in outputs at some point.

I have been watching various development efforts around really focusing on dropping perplexity as low as it'll go and giving back inference speed and RAM/VRAM for sake of 0.1 perplexity gains and it's got my dev brain wondering about what the specific wins are so people can make informed decisions about the value of various quants at various parameter counts. Right now it's all pretty loosey goosey which is fine, but those sorts of decisions might really start to matter when people are budgeting for non-hobby projects around these models and their quantized versions. If I'm a project lead, I want inference speed as high as possible on the least hardware possible if there's negligible difference between 6.0 and 5.9 perplexity. That's less applicable for GPTQ since Triton runs on Linux, where most non-hobby projects will live, but as ggml advances and gets GPU inference figured out, it could come into play with their giant menagerie of quantization formats.

I'll also dream of someone getting --act-order working on CUDA so this entire discussion becomes pointless for GPTQ quants. I messed with it a bit but I don't have enough free time to dig into it so didn't make much progress.

autobots

May 2, 2023

Act order works without group size on the ooba version of gptq. Just don't encode with the newest cuda branch as I think it changes the format yet again for the 3rd time.

So choices are either act order + true sequential or group size. Hence they get nonsense perplexity when they used them together.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment