How did you run this on 512GB mac studio?

#2
by jimreynold2nd - opened

I can't seem to run this on mlx-lm head (currently 2c008fd0252b2c569227d12568356ab88ab0560a) + PR 1410 applied on top, then installed with pip install --editable ..

mlx_lm.server --model ~/models/glm-5.2-mlx-DQ4plus --host 0.0.0.0 --port 8090 --log-level INFO --temp 1.0 --top-p 0.95 --max-tokens 20000 --chat-template-args '{"reasoning_effort": "high"}'

This seemed to load the model, and /v1/models works, but when an actual generation request comes through, the process gets killed (OOM?).

So I'm wondering how you managed to run this on your 512GB mac. Any special set up?
(note: I'm relatively new to mlx-lm; but I've been using this mac with llama.cpp a lot before and I was using Q5_K_M of GLM-5.1 using llama.cpp on it for a while so I know it's not a hardware issue)

try closing the apps who is eating from you ram...

and try again

if still...

try this command:

mlx_lm.generate --model ~/models/glm-5.2-DQ4plus --prompt "Hello"

and if still didnt work...

try giving me the output of this command:

ls -lh ~/models/glm-5.2-mlx-DQ4plus

so i can verify you have the correct files...

and also note that its not a single file from this repo... but its much more...

("its not like llama.cpp with one file")

mlx is different...

Sign up or log in to comment