numen-tech/Meta-Llama-3-70B-Instruct-w4a16g128asym · Absolute Legend! Any chance we'll see a w3a16g128?

May 10

I've got "only" 64GB VRAM and I planned on running this + 2bit AQLM Llama-3-70B (as a RAG) which'll be a bit tight. With 35 & 17.5GB = 52.5GB.

With a 26.25GB model I'd be light as a feather for whatever stupid mischief I get into! But this itself is a blessing, I can't thank you enough because I was having issues quantizing. I'm using my Macbook M1 and wanted to use an AWQ Llama as Omniquant docs suggests.

Leads to another Q I had, did you follow the docs to the T or did you take some liberties? Besides doing a --real-quant, wondering if you did the prep procedure specified w/ generate_act_scale_shift.py.

Thanks again 🥳

zackzachzaczak changed discussion title from Absolute Legend! Any chance we'll see a w3a16g128asym? to Absolute Legend! Any chance we'll see a w3a16g128? May 10

numen-tech

Owner May 11

•

edited May 30

Any chance we'll see a w3a16g128

Yes, soon! Just about to upload a w3a16g40sym version.

Leads to another Q I had, did you follow the docs to the T or did you take some liberties? Besides doing a --real-quant, wondering if you did the prep procedure specified w/ generate_act_scale_shift.py.

Actually, I didn't do --real-quant, and performed the perplexity evals separately, post-quantization using the same wiki text and C4 datasets. Also, did prepare activation scales and shifts with generate_act_scale_shift.py.

@zackzachzaczak the w3a16g40sym version of Llama 3 70B has been up for a couple of weeks now.

zackzachzaczak

May 13

Solid setup! I'm very glad I stumbled upon your models, seems you are the only 1 churning out OQ models, its so underutilized.

Any reason for the low group size? Would've imagined a w4a16g40 + w3a16g128 would be a more balanced approach, but I can't say I have much of any experience to back that up. Doesn't even matter terribly for me, I'm being greedy (trying to run an AQLM 2-bit Llama70B as a RAG (~17.5GB) and run the w3 (26.25GB) as my LLM, which'd leave my 64GB M1 with 21GB to spare. Figure I'd be a bit squeezed with a w4 @ 35GB.

I primarily use it for code so who knows if I'll stick with it. I'm also waiting on a 2-bit AQLM Wizard 8x22B

numen-tech

Owner May 13

Any reason for the low group size?

I'd noticed with other models that perplexity seems to degrade badly with larger group sizes. So, stuck with this. The tradeoff is that there's a bigger compute overhead with smaller group sizes, so inference perf takes a hit. By the way, tried w3a16g40 for Mixtral 8x7B and it didn't work out.

I'm also waiting on a 2-bit AQLM Wizard 8x22B

I'd w4a16g128 quantized Wizard LM2 8x22B a while ago. Perhaps worth revisiting.

BuildBackBuehler

May 28

Hey, was wondering if you could advise me on how to run the model? 😂

With the ndarray-cache.json rather than your typical model.safetensor.index.json, etc. I'm unsure. I know MLC-LLM utilizes ndarray-cache but its a pain to integrate a new quant. method properly. Although I have no clue if the model suffers performance-wise if one just generates an MLC-Chat-Config + compilation lib. under the guise that it is q0f32 aka no quantization.

Also tried converting the ndarray-cache.json to model.safetensors.index.json but no dice there. Lastly tried something Claude conjured up to load it without luck.

For whatever reason, perhaps the Omniquant repo's disorganized disarray, I left it as a last ditch effort. Don't see any other OQ loader besides whats referenced in https://github.com/OpenGVLab/OmniQuant/blob/main/runing_falcon180b_on_single_a100_80g.ipynb

So I'll be trying that next

numen-tech

Owner May 30

Hey, was wondering if you could advise me on how to run the model? 😂

The easiest way would be use our app 😀.

I know MLC-LLM utilizes ndarray-cache but its a pain to integrate a new quant. method properly. Although I have no clue if the model suffers performance-wise if one just generates an MLC-Chat-Config + compilation lib.

Not really. Adding extra group quantization (which is what OmniQuant weight only quantization is) schemes to mlc-llm is quite easy. Just add a new member like this to the QUANTIZATION dict in quantization.py.

QUANTIZATION: Dict[str, Quantization] = {
...
    "w4a16g128asym": GroupQuantize(
        name="w4a16g128asym",
        kind="group-quant",
        group_size=128,
        quantize_dtype="int4",
        storage_dtype="uint32",
        model_dtype="float16",
        linear_weight_layout="NK",
    )
}

BuildBackBuehler

May 30

Hey, was wondering if you could advise me on how to run the model? 😂

The easiest way would be use our app 😀.

I know MLC-LLM utilizes ndarray-cache but its a pain to integrate a new quant. method properly. Although I have no clue if the model suffers performance-wise if one just generates an MLC-Chat-Config + compilation lib.

Not really. Adding extra group quantization (which is what OmniQuant weight only quantization is) schemes to mlc-llm is quite easy. Just add a new member like this to the QUANTIZATION dict in quantization.py.
QUANTIZATION: Dict[str, Quantization] = {
...
    "w4a16g128asym": GroupQuantize(
        name="w4a16g128asym",
        kind="group-quant",
        group_size=128,
        quantize_dtype="int4",
        storage_dtype="uint32",
        model_dtype="float16",
        linear_weight_layout="NK",
    )
}

D: 😩 I always manage to complicate everything in the coding space. I thought I had to integrate the entire pipeline for it to properly work. Granted, I was also trying to quantize the model before you uploaded this, so I probably would need to do what I was doing.

I really appreciate you providing the answer to that (thoroughly, I might add) -- with that being said, you'll be happy to know I bought your app (put 2+2 together that NumenTech = PrivateLLM) 😂. Figure it'll be good for mobile usage in any case and my parents throwaway $ using pay-per-token ChatGPT clones, so this'll break 'em free of that money pit.