Instructions to use mlx-community/GLM-5.2-DQ4plus-q8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use mlx-community/GLM-5.2-DQ4plus-q8 with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("mlx-community/GLM-5.2-DQ4plus-q8") prompt = "Write a story about Einstein" messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) text = generate(model, tokenizer, prompt=prompt, verbose=True) - Transformers
How to use mlx-community/GLM-5.2-DQ4plus-q8 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="mlx-community/GLM-5.2-DQ4plus-q8") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("mlx-community/GLM-5.2-DQ4plus-q8") model = AutoModelForCausalLM.from_pretrained("mlx-community/GLM-5.2-DQ4plus-q8") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
- vLLM
How to use mlx-community/GLM-5.2-DQ4plus-q8 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "mlx-community/GLM-5.2-DQ4plus-q8" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "mlx-community/GLM-5.2-DQ4plus-q8", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/mlx-community/GLM-5.2-DQ4plus-q8
- SGLang
How to use mlx-community/GLM-5.2-DQ4plus-q8 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "mlx-community/GLM-5.2-DQ4plus-q8" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "mlx-community/GLM-5.2-DQ4plus-q8", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "mlx-community/GLM-5.2-DQ4plus-q8" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "mlx-community/GLM-5.2-DQ4plus-q8", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Pi
How to use mlx-community/GLM-5.2-DQ4plus-q8 with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "mlx-community/GLM-5.2-DQ4plus-q8"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "mlx-community/GLM-5.2-DQ4plus-q8" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use mlx-community/GLM-5.2-DQ4plus-q8 with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "mlx-community/GLM-5.2-DQ4plus-q8"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default mlx-community/GLM-5.2-DQ4plus-q8
Run Hermes
hermes
- MLX LM
How to use mlx-community/GLM-5.2-DQ4plus-q8 with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Interactive chat REPL mlx_lm.chat --model "mlx-community/GLM-5.2-DQ4plus-q8"
Run an OpenAI-compatible server
# Install MLX LM uv tool install mlx-lm # Start the server mlx_lm.server --model "mlx-community/GLM-5.2-DQ4plus-q8" # Calling the OpenAI-compatible server with curl curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "mlx-community/GLM-5.2-DQ4plus-q8", "messages": [ {"role": "user", "content": "Hello"} ] }' - Docker Model Runner
How to use mlx-community/GLM-5.2-DQ4plus-q8 with Docker Model Runner:
docker model run hf.co/mlx-community/GLM-5.2-DQ4plus-q8
Garbage very soon in the output
I'm seeing broken language in this quant at under a 32k context. Broken English, repetition. etc. I can't tell definitively if it is quant itself but it's the first time I'm encountering this. Thanks..Kevin
Yep, I see the same thing on glm5.1 mxfp4-q8 from the same creator after mlx-lm v0.31.2 on TP across 2 m3u mac studios. It ran well on mlx-lm v0.31.1. This time it happens again to this model.
If there are issues with the quant, I'd like to fix this. Can you please give some more details on how you are running the quant? You mentioned that you are running it across 2 m3u mac studios. How are you doing this, which versions of software are you using, etc.
Are you running this on a single m3u, or also across 2 m3u mac studios? How are you running this?
There might be some subtle bugs left in PR #1410. In that case, any info could help to improve this PR.
Perhaps you could also try https://huggingface.co/mlx-community/GLM-5.2-4bit? If the 4-bit version is working fine, then we know this quant is faulty. If not, then we know it is probably something in mlx-lm.
[Edited for clarity]
@bibprojThanks for the response. Yes, I have 2 M3 Ultra's, a 512gb and a 256gb and I'm using pipeline distribution over an RDMA connection for this model.
When I look at what EXO is using, I see the following.
I'm seeing the mlx_lm,version is 0.31.3. returned from run python -c "import mlx_lm; print(mlx_lm.file)
a uv pip freeze | grep mlx in the exo folder gives the following. hope this is helpful. It does seem like what I am seeing is context related. It's not obvious at first, but then creeps in with context and then gets really bad.
mlx @ git+https://github.com/rltakashige/mlx-jaccl-fix-small-recv.git@cc3f3e60be1289506125f2fa19b73b05aa770df8
mlx-lm @ git+https://github.com/rltakashige/mlx-lm@6a3df6cd6b00a347ee40f12d97a182aaf86ea599
mlx-vlm==0.4.4
I'm going to try the GLM-5.2-4 bit now and I'll provide an update here.
uv pip show exo
Name: exo
Version: 0.3.70
In exo folder:
git rev-parse --short HEAD
git log -1 --online
main, origin/main, origin/HEAD) libp2p -> zenoh (#2132)
--Edited, responded to the wrong person. apologies.
You could also try https://huggingface.co/mlx-community/GLM-5.2-4bit. If that one is working OK, it is this quant.
Yep. Downloading. In the meanwhile, I can report my findings.
UV project settings:
requires-python = ">=3.14"
dependencies = [
"mlx-lm",
"tiktoken>=0.12.0",
]
[tool.uv.sources]
mlx-lm = { git = "https://github.com/pcuenca/mlx-lm", branch = "glm-moe-dsa-indexer-sharing" }
[tool.uv]
prerelease = "allow"
Here is a curl cmd 100% reproduces the issue:
curl -s http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mlx-community/GLM-5.2-DQ4plus-q8",
"messages": [
{
"role": "user",
"content": "qahft rxcka fnafq ofpva usiey iccwp usnzj ovqwp sbfhc gchqj jfgyq pesej zqorv ufaig fywir kxlgg ogpxk fzncb cqukb jznzw asrng qclly wgnex whqpd touna iaywv hbwyc mbttd mogwl fosfi zqlnd fipff bqfxw bgrfd yomuu ecllm srzck iwgel khgyl wobzv zywem fkbjz gulky zoseh zpotb pnwey ceprg dxgpq kpnyf sgkrh itblz zbfgy wwjev spzra vhryd dcohp sfqgm xvclh auqgt olabw xovpd dixuw xfgcu wkqey wzvwa tiyuw vgucw wfvlh ufafi wzhqk znycz ezgcl sipnk ogsay nstrj briiw shikk hdkyr xqhoa hlprm lfmxu ecnqi vtrff agmwb kqfsm gratu cleyn bgwlu rzpyx psnvo xtmgg qtnqh chhio dgssb kokfk xpswt jajtw yktop fflai rkemd qakoa qdmbj fitjt vgcao zjquq tyfad drofs teptc vzcai ryksb rqcuw pdzuj ljniw vcyqv sltzh vnmlt gwvcw gmpja wefui wshay csmtr qmucf rhpmn ltish gdfdn tefms znhcf feanh zosdw mwoml ebymv kbqbd ohzdt pufnl lwzhq ptyff reeba lphgs"
}
],
"temperature": 1.0,
"max_tokens": 100,
"stream": false
}'
Response on dual macs
The model is served by uv run mlx.launch --hostfile topology.json --env MLX_METAL_FAST_SYNCH=1 --env HF_HUB_OFFLINE=1 -- /Users/imbible/projects/ai/mlx-dist/.venv/bin/mlx_lm.server --model mlx-community/GLM-5.2-DQ4plus-q8 --port 8080 --host 0.0.0.0
{"id": "chatcmpl-9d20dfd7-d425-4add-9856-0201ad7836f7", "system_fingerprint": "0.31.3-0.31.2-macOS-26.5.1-arm64-arm-64bit-Mach-O-applegpu_g15d", "object": "chat.completion", "model": "mlx-community/GLM-5.2-DQ4plus-q8", "created": 1781985842, "choices": [{"index": 0, "finish_reason": "length", "message": {"role": "assistant", "reasoning": "SandersKWKWSandersKWSandersDDSKWSandersKWKWSandersDDSKWSandersKWSandersKWJSandersKWSandersSandersKWSandersKWSandersKWSandersKWSandersKWSandersDDSKWSandersSandersKWSandersDDSKWSandersSandersSandersSandersSandersDDSKWSandersKWSandersDDSDDSKWSandersDDSSandersDDSSandersDDSKWSandersDDSDDSDDSDDSDDSSandersDDSKWSandersSandersDDSDDSDDSDDSSandersDDSDDSDDSSandersDDSDDSDDSDDSSandersDDSSandersDDSDDSDDSDDSDDSSandersDDSDDSDDSDDSDDSDDS"}}], "usage": {"prompt_tokens": 406, "completion_tokens": 100, "total_tokens": 506, "prompt_tokens_details": {"cached_tokens": 405}}}
Response on single mac
The model is served by uv run mlx_lm.server --model mlx-community/GLM-5.2-DQ4plus-q8 --host 0.0.0.0 --port 8080
{"id": "chatcmpl-b22f8d75-7676-42ae-ba2d-33989b113eac", "system_fingerprint": "0.31.3-0.31.2-macOS-26.5.1-arm64-arm-64bit-Mach-O-applegpu_g15d", "object": "chat.completion", "model": "mlx-community/GLM-5.2-DQ4plus-q8", "created": 1781987386, "choices": [{"index": 0, "finish_reason": "length", "message": {"role": "assistant", "reasoning": "The user wants me to process a sequence of 5-letter words.\n\nThere are 168 words in total.\n\nLet's look at the words:\nqahft rxcka fnafq ofpva usiey iccwp usnzj ovqwp sbfhc gchqj jfgyq pesej zqorv ufaig fywir kxlgg ogpxk fzncb cqukb jznzw asrng qclly wgnex wh"}}], "usage": {"prompt_tokens": 406, "completion_tokens": 100, "total_tokens": 506, "prompt_tokens_details": {"cached_tokens": 0}}}
Gemini 3.1 Pro says:
It is entirely caused by mlx-lm 's Tensor Parallelism ( shard_inplace ) improperly slicing the mixed-bitwidth arrays (5-bit and 6-bit quantization groups) in the DQ4plus-q8 format, shearing the group boundaries and poisoning the math on those specific MoE experts.
It is interesting that I also tested mlx-community/GLM-5.2-mxfp4 and it behaves even more weird than this model on dual-mac TP through mlx-lm jaccl. At least this model with DQ4plus respond to some short prompts well. The mxfp4 one almost outputs gibberish regardless of prompt length.
Thank you for the initial feedback already.
You can also try to add --chat-template-config '{"enable_thinking": false}' to the command-line of mlx. If your prompts work with this, then it means the reasoning causes the issues you see. That might be useful to discover what is happening here.
Same as @labhraighlep : gibberish output, but seen on other quantizations:
- kernelpool/GLM-5.2-8bit
- pipenetwork/GLM-5.2-MLX-4bit
Tested both with TP across 2 Mac Studio (for 8bit version) and without (for 4bit version on 1 Mac).
Always the same: above a certain system prompt size, the output is just garbage (broken language, repetitions, even double </think>).
I will test @bibproj suggestion later on.
Easy to reproduce with a brand new installation of OpenClaw:
- cloud Full Precision: just works
- any local quantized version (with EXO): never works
@labhraighlep @imbible @cgeekm
I mentioned this thread in PR #1410 (https://github.com/ml-explore/mlx-lm/pull/1410)
I checked my settings. I was running with reasoning off when I first encountered this.
[ 12:04:50.6360PM | INFO ] Executing command: TextGeneration(command_id='76a007ba-eca9-4265-9c09-7326c8fc2fa9' task_params=TextGenerationTaskParams(model='mlx-community/GLM-5.2-DQ4plus-q8', input=[InputMessage(role='user', content=<InputMessageContent: Who are you?>)], instructions=None, max_output_tokens=None, temperature=1.0, top_p=0.95, stream=True, tools=None, bench=False, use_prefix_cache=False, top_k=None, stop=None, seed=None, chat_template_messages=[{'role': 'user', 'content': <InputMessageContent: Who are you?>}],
--> reasoning_effort='none', enable_thinking=False,
logprobs=False, top_logprobs=None, min_p=None, repetition_penalty=None, repetition_context_size=None, presence_penalty=None, frequency_penalty=1.1, images=[], image_hashes={}, prefill_endpoint=None))
@bibproj
The PR #1410 (https://github.com/ml-explore/mlx-lm/pull/1410) fixes everything for me! All the models I mentioned, single Mac or two with TP/RDMA.
I guess mlx-community/GLM-5.2-DQ4plus-q8 will work too (can't test, I did not download this one).
I have applied the PR to the version of MLX-LM used by EXO. But, I won't be able to test this until later in the day.
Seems solved. Can you confirm @labhraighlep and @imbible ?
I can confirm this works correctly after applying 2 fixes. Besides the one you've put in readme (ml-explore/mlx-lm#1410) which I had already applied, the missing one that is essential for multi-GPU TP setups is ml-explore/mlx#3451. It is not in the latest release of mlx, so it is required to build from main.
The PR #1410 (https://github.com/ml-explore/mlx-lm/pull/1410) fixes everything for me! All the models I mentioned, single Mac or two with TP/RDMA.
I guess
mlx-community/GLM-5.2-DQ4plus-q8will work too (can't test, I did not download this one).
Is the 1410 PR the only change you made? Did you get the quants you are running working in EXO? I got it running on the 512 only in MLX-VLM. I have dissimilar memory configs on the 2 Ultras I have which complicates sharding outside of EXO.
The PR #1410 (https://github.com/ml-explore/mlx-lm/pull/1410) fixes everything for me! All the models I mentioned, single Mac or two with TP/RDMA.
I guess
mlx-community/GLM-5.2-DQ4plus-q8will work too (can't test, I did not download this one).Is the 1410 PR the only change you made? Did you get the quants you are running working in EXO? I got it running on the 512 only in MLX-VLM. I have dissimilar memory configs on the 2 Ultras I have which complicates sharding outside of EXO.
Essentially yes, only this PR. But applied on mlx-lm = { git = "https://github.com/rltakashige/mlx-lm", branch = "leo/deepseek-v4" } which I forked on my own GitLab to apply the PR (leo/deepseek-v4 is EXO's dependency from commit https://github.com/exo-explore/exo/commit/90f24bef30ceef09b45946970f53b43bf44c3206 from which I rebased recently).
I'm running exclusively with EXO, on 2 Mac Studio 512GB with TP through RDMA, with kernelpool/GLM-5.2-8bit. Using it a lot since my message (with an OpenClaw fed by a custom Kanban for several hours long). Encountered zero problem, working like a charm.
Cool...good to know. I had done the same was also from the Leo branch to ensure the deep-v4 settings were in the baseline before applying PR 1410. No luck, I got this specific model to launch 1 time on the 512 and it failed on sharding every time I attempted to use distributed using pipeline (not tp). I'm not distributing on 2 * 512 . 512 and a 256.
IMO, this issue was closed too early.
Question, did you get this model specifically to load/distribute for you? I'll look at the explore/mlx#3451 precursor @imbible mentioned. I suspect he also has 2 * 512s.
Cool...good to know. I had done the same was also from the Leo branch to ensure the deep-v4 settings were in the baseline before applying PR 1410. No luck, I got this specific model to launch 1 time on the 512 and it failed on sharding every time I attempted to use distributed using pipeline (not tp). I'm not distributing on 2 * 512 . 512 and a 256.
IMO, this issue was closed too early.
Question, did you get this model specifically to load/distribute for you? I'll look at the explore/mlx#3451 precursor @imbible mentioned. I suspect he also has 2 * 512s.
Hi, i'm in the same boat as you 2 *512s. gonna try building off mlx#3451, will report on how it goes
unfortunately did not fix my problem.
Here is my EXO setup: https://github.com/c-geek/exo
It actually integrates two patches:
- mlx-lm with PR #1410 : https://github.com/c-geek/mlx-lm/tree/glm-5.2-dsa
- mlx custom patch for JACCL when we have only 2 studios: https://github.com/c-geek/mlx/tree/address-rdma-gpu-locks
I did not get mlx-community/GLM-5.2-DQ4plus-q8 specifically yet. But I will in the coming days.
Thank you so much for the in-depth response because it was a major boost to me. That JACCL patch was huge because the the model loading and then stalling at prefill was causing memory to not be released and consistent reboots to continue testing. Exhausting and time consuming. I baselined from your repo, very helpful and I got the mlx 4 bit model and the Q8 one this discussion is referencing to load on the single 512. However; once I attempted to redistribute, I ran into an error. Thankfully, the memory was releasing even with the failure which sped up my troubleshooting. There is an issue with the auto_parallel.py script where auto_parallel sends a tuple to mx.distributed.send() which expects an mx.array. After working through a fix for this script with the help of chatgpt, we got it working and loading on the 512 and 256. Worked for both the 4 bit and the dq4plus-q8 quant which both failed to redistribute originally.
Now I can look at long context for a bit and see if the originally issue seen with the quant is still present. π
Edit: Garbled long context for this model is resolved by these fixes. Formatting is also working correctly now.
Perhaps also relevant to the topic?
https://github.com/ml-explore/mlx-lm/pull/1431
Thanks @bibproj for pointing that out! However, according to my tests GLM 5.2 is not affected by that issue, only Deepseek 3.2 is.
Glad to see everything works now!
Now that you mention it, I remember that I struggled into this RAM not being released issue. Had to reboot a lot too ... time and motivation consuming, indeed.