What kind of hardware environment do you use?

#14
by bobospace - opened

I run grok-1-IQ3_XS-split-00001-of-00009.gguf model on my M3 Max 128g MBP with command line
"./server -m grok-1-IQ3_XS-split-00001-of-00009.gguf --port 8888 --host 0.0.0.0 --ctx-size 1024 --parallel 4 -ngl 999 -n 512"
but give me 0.02 tokens per second

Owner

I'm on a Threadripper with 256G RAM, with no Apple experience, but have a look at this.
It might just be that your're out of RAM, 128G not much for Grok. I have smaller quants incoming soon.

Thanks. The really wired is I compile llama.cpp with metal support and run with -ngl 99, still really slow but RAM just 50% usage.

If I merge those splited files into one gguf format file, can I use ./gguf-split --merge to do it?

Owner

Yes gguf-split --merge should merge the files. That won't change anything about your memory issues tho.
Maybe look into mmap and how Memory gets reported (cache vs process memory).

To rent on vast.ai or runpod.io (for 2-3 bit quants):
2xH100 or 2xA100 80GB.

Sign up or log in to comment