I'm struggling making the Q2 version work on my M1 Max machine.

by LucaColonnello - opened Apr 21, 2024

Apr 21, 2024

•

edited Apr 21, 2024

I'm struggling making the Q2 version work on my M1 Max machine.
Is 24GB VRAM too low to make this work? I have a 32GB RAM machine.

It loads fine if I give it 40 GPU layers, but then it's very slow, ~2 tok/sec, and 16s of time to first token.

bartowski

LM Studio Community org Apr 21, 2024

I quants run very slowly on Metal so that's probably why you're getting worse performance, I'll add the regular Q2_K version that should run better (but with no imatrix support)

rileyretzloff

Apr 22, 2024

I quants run very slowly on Metal so that's probably why you're getting worse performance, I'll add the regular Q2_K version that should run better (but with no imatrix support)

That's unfortunate, and certainly news to me. Anywhere I can read up on why this is?

bartowski

LM Studio Community org Apr 22, 2024

Why I'm not entirely sure, but you can find the source at the very least here:

https://github.com/ggerganov/llama.cpp/wiki/Feature-matrix

Joseph717171

Apr 23, 2024

•

edited Apr 23, 2024

Use this for your 32GB Apple M1 Max machine:

sudo sysctl iogpu.wired_limit_mb=28672

Just keep in mind this will be using ~88% of your RAM (unified-memory)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment