Inference Issues

#1
by qeternity - opened

I saw you uploaded a Marlin packed version shortly after me. Are you running this on vLLM by any chance?

I am having real inference issues. I tried your version as well and I have the same issues. FP16 works fine though.

Owner

Hey @qeternity yes I ran this in vLLM, it seemed to be reasonable but I haven't run proper evaluations on it yet.

It seems to work alright at very short contexts, but breaks beyond that (same with my version).

I should say I am running via SGLang (which uses vLLM) and I opened a PR for the prompt templating tonight, so I may have gotten something wrong there (but I don't think so given fp16 is fine).

Ok - this is an issue with their chat template in tokenizer config being wrong.

The correct one is here: https://github.com/meta-llama/llama3/blob/92a325ec9925557b5fd64202c91024231a428c08/llama/test_tokenizer.py#L67

EDIT: nevermind, after inspecting the tokenizer, it seems the above comments are wrong.

Owner

@qeternity it might still be a tokenization issue, check out this fix that just landed last night in vllm https://github.com/vllm-project/vllm/pull/4182

Alright so I have almost the exact same issues with your version as I do with my own. I suspect we are quanting the same way. I also tried bf16 -> fp32 pre quant but that did not change anything.

Weirdly if I download another GPTQ model and repack with marlin, then everything works fine. The first 8B desc_act=False I found is this one: https://huggingface.co/MaziyarPanahi/Meta-Llama-3-8B-Instruct-GPTQ

You're not having any issues with this? I'm just trying to figure out why my quant script is not working now.

@qeternity Any news about which Marlin version is working correctly ?
I see you uploaded a new version Apr 28
Is the latest version fixing the mentioned issue?

I think the new version I uploaded was simply to handle the ever-changing quant_config formatting.

But no, I was never able to get this issue fixed. The only quants that work are ones which do not use the chat template, which is obviously going to result in a worse quant (how much of a difference that makes is not clear to me). My quant and all of the others, afaict, and simply using wikitext to do the quant.

Sign up or log in to comment