I'm struggling making the Q2 version work on my M1 Max machine.

#2
by LucaColonnello - opened

I'm struggling making the Q2 version work on my M1 Max machine.
Is 24GB VRAM too low to make this work? I have a 32GB RAM machine.

It loads fine if I give it 40 GPU layers, but then it's very slow, ~2 tok/sec, and 16s of time to first token.

LM Studio Community org

I quants run very slowly on Metal so that's probably why you're getting worse performance, I'll add the regular Q2_K version that should run better (but with no imatrix support)

I quants run very slowly on Metal so that's probably why you're getting worse performance, I'll add the regular Q2_K version that should run better (but with no imatrix support)

That's unfortunate, and certainly news to me. Anywhere I can read up on why this is?

LM Studio Community org

Why I'm not entirely sure, but you can find the source at the very least here:

https://github.com/ggerganov/llama.cpp/wiki/Feature-matrix

Use this for your 32GB Apple M1 Max machine:

sudo sysctl iogpu.wired_limit_mb=28672

Just keep in mind this will be using ~88% of your RAM (unified-memory)

Sign up or log in to comment