Successful 512GB Mac Studio run with Pedro PR branch

#2
by hugfaceme - opened

I got mlx-community/GLM-5.2-mxfp4 running locally on a 512GB Mac Studio.

My original failure path was the same class of issue: missing self_attn.indexer.* / missing indexer parameters. In my case the model download was fine , the key was using the patched mlx-lm runtime for GLM-5.2 DSA/indexer sharing.

Working setup:

512GB Mac Studio
mlx-community/GLM-5.2-mxfp4
clean isolated venv
MLX / mlx-metal 0.31.2
Pedro’s mlx-lm branch: glm-moe-dsa-indexer-sharing
jinja2
sudo sysctl iogpu.wired_limit_mb=480000
tested first with mlx_lm.generate
then ran local mlx_lm.server
Open WebUI connected to http://127.0.0.1:8080/v1
One small practical gotcha: I needed jinja2. Without it, the tokenizer chat template failed before generation.

Here is exact practical steps here:

https://gist.github.com/LocalAiCherry/6c769b529ebaebc6088449f63025676d

One small practical dependency I had to add was jinja2, a Python template library used by the tokenizer chat template. Without it, generation stopped before the model could answer.

what speed to you get ? Do you see much difference with the official unquantized model ?

Sign up or log in to comment