need help to generate the GPTQ version

#1
by yiouyou - opened

I bought this model, but how could I get the GPTQ (gptq-4bit-32g-actorder_True) version of this model, could you help me?

Thanks,
Zack

yiouyou changed discussion title from May to need help to generate the GPTQ version

Howdy, making the GPTQ now, should be up shortly.

Thanks, please consider use the setting as below:

quantize_config = BaseQuantizeConfig(
bits=4,
group_size=32,
desc_act=True,
)

@yiouyou , bad news, I tried twice over 5 hours and failed to get the quant made.

I'll see if I have time on Monday to get to it.

BTW, you have access to the GPTQ quantization script in the ADVANCED-fine-tuning repo (see the quant branch). In principle, it should take about an hour to run on an A100.

Also, what are you using for inference? AWQ is faster and more accurate than GPTQ - and also easier for me to make if you need it. It will work with vLLM or TGI provided you're using an A100 or A6000 or H100.

I'm using A100 40G to run the model. I've tried to quantize the model with both AWQ and GPTQ scripts on a runpod A100 80G, but none of them finished in 3 hours. The jupyter code block seems running, no errors, but nothing happend for a very long time. I have to stop the runpod.

@RonanMcGovern How long will it take to generate a AWQ or GPTQ for a 34B model? Hope you can generate them successfully when you have time.

Thanks,

Howdy @yiouyou , the AWQ version is live now on the AWQ branch. Unfortunately, TGI currently is having an issue with AWQ, hopefully will get resolved soon: https://github.com/huggingface/text-generation-inference/issues/1322

I believe the issue that we faced in making the AWQ was that in tokenizer_config, the AutoTokenizer was set to the original model, not the llamafied version.

I'm running the GPTQ now, let's see if that fix helps there too.

BTW, the issue with GPTQ quanting is slow packing. The fix is here. Hopefully I'll have the GPTQ ready in an hour or two.

Ok, I've just pushed a GPTQ model. Unfortunately, it's the default of:

quantize_config = BaseQuantizeConfig(
    bits=4,  # quantize model to 4-bit
    group_size=128,  # it is recommended to set the value to 128
    desc_act=False,  # desc_act and group size only works on triton
)

FWIW I've updated the GPTQ quantization script in ADVANCED-fine-tuning and it runs in about 1 hour and 10 mins on an A100.

LMK if you absolutely need the 32g, True configuration.

@RonanMcGovern "traindataset, testenc = get_wikitext2(128, 0, 2048, pretrained_model_dir)" this line taks too much time to run, do you have the same situation?

@RonanMcGovern I've tried AWQ script, it does generated all most required files successfully, but without 'tokenizer.model' file. Do you know why is that? I started a issue on repo of ADVANCED.

@RonanMcGovern "traindataset, testenc = get_wikitext2(128, 0, 2048, pretrained_model_dir)" this line taks too much time to run, do you have the same situation?

I don't have this issue. This line seems to execute quite fast for me on an A100 runpod instance.

@RonanMcGovern I've tried AWQ script, it does generated all most required files successfully, but without 'tokenizer.model' file. Do you know why is that? I started a issue on repo of ADVANCED.

AWQ (or GPTQ) don't generate the tokenizer.model file. It is added by the repo owner on the base repo and then sometimes needs to be manually copied across. The need for tokenizer.model is causing issues in general with quantization because GGUF and GPTQ require it but some newer models don't have it. See here: https://github.com/ggerganov/llama.cpp/pull/3633#issuecomment-1849031538 . Hopefully this will get fixed soon, but the issue has been there for a few weeks. In the meantime, manually copying often works.

@RonanMcGovern When AWQ, is this msg "Token indices sequence length is longer than the specified maximum sequence length for this model (8948 > 4096). Running this sequence through the model will result in indexing errors" a concern for you? If it is, how to fix it? Thanks~

@RonanMcGovern When AWQ, is this msg "Token indices sequence length is longer than the specified maximum sequence length for this model (8948 > 4096). Running this sequence through the model will result in indexing errors" a concern for you? If it is, how to fix it? Thanks~

I don't believe this is an issue. It didn't stop me quantizing.

BTW, I spent a few hours today trying to get AWQ and GPTQ to work. I'm puzzled because the current model doesn't work, but this v2 model does work running on tgi! I've given you access, let me know if you see any inconsistencies.

Trelis org

FYI, the SUSChat function calling model now has an AWQ branch and runpod vLLM template. SUSChat is a fine-tuned version of Yi.

Reach out by commenting further here if anyone needs an AWQ of this Yi model (for now, running with TGI and eetq is the recommended way to inference this Yi model.

Sign up or log in to comment