Garbage very soon in the output

#3
by labhraighlep - opened

I'm seeing broken language in this quant at under a 32k context. Broken English, repetition. etc. I can't tell definitively if it is quant itself but it's the first time I'm encountering this. Thanks..Kevin

Yep, I see the same thing on glm5.1 mxfp4-q8 from the same creator after mlx-lm v0.31.2 on TP across 2 m3u mac studios. It ran well on mlx-lm v0.31.1. This time it happens again to this model.

MLX Community org
β€’
edited 5 days ago

@imbible

If there are issues with the quant, I'd like to fix this. Can you please give some more details on how you are running the quant? You mentioned that you are running it across 2 m3u mac studios. How are you doing this, which versions of software are you using, etc.

@labhraighlep

Are you running this on a single m3u, or also across 2 m3u mac studios? How are you running this?

There might be some subtle bugs left in PR #1410. In that case, any info could help to improve this PR.

MLX Community org
β€’
edited 5 days ago

@labhraighlep @imbible

Perhaps you could also try https://huggingface.co/mlx-community/GLM-5.2-4bit? If the 4-bit version is working fine, then we know this quant is faulty. If not, then we know it is probably something in mlx-lm.

[Edited for clarity]

@bibprojThanks for the response. Yes, I have 2 M3 Ultra's, a 512gb and a 256gb and I'm using pipeline distribution over an RDMA connection for this model.

When I look at what EXO is using, I see the following.
I'm seeing the mlx_lm,version is 0.31.3. returned from run python -c "import mlx_lm; print(mlx_lm.file)
a uv pip freeze | grep mlx in the exo folder gives the following. hope this is helpful. It does seem like what I am seeing is context related. It's not obvious at first, but then creeps in with context and then gets really bad.
mlx @ git+https://github.com/rltakashige/mlx-jaccl-fix-small-recv.git@cc3f3e60be1289506125f2fa19b73b05aa770df8
mlx-lm @ git+https://github.com/rltakashige/mlx-lm@6a3df6cd6b00a347ee40f12d97a182aaf86ea599
mlx-vlm==0.4.4
I'm going to try the GLM-5.2-4 bit now and I'll provide an update here.

uv pip show exo
Name: exo
Version: 0.3.70

In exo folder:
git rev-parse --short HEAD
git log -1 --online
main, origin/main, origin/HEAD) libp2p -> zenoh (#2132)

--Edited, responded to the wrong person. apologies.

@labhraighlep @imbible

You could also try https://huggingface.co/mlx-community/GLM-5.2-4bit. If that one is working OK, it is this quant.

Yep. Downloading. In the meanwhile, I can report my findings.
UV project settings:

requires-python = ">=3.14"
dependencies = [
    "mlx-lm",
    "tiktoken>=0.12.0",
]

[tool.uv.sources]
mlx-lm = { git = "https://github.com/pcuenca/mlx-lm", branch = "glm-moe-dsa-indexer-sharing" }

[tool.uv]
prerelease = "allow"

Here is a curl cmd 100% reproduces the issue:

curl -s http://localhost:8080/v1/chat/completions \   
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/GLM-5.2-DQ4plus-q8",
    "messages": [
      {
        "role": "user",
        "content": "qahft rxcka fnafq ofpva usiey iccwp usnzj ovqwp sbfhc gchqj jfgyq pesej zqorv ufaig fywir kxlgg ogpxk fzncb cqukb jznzw asrng qclly wgnex whqpd touna iaywv hbwyc mbttd mogwl fosfi zqlnd fipff bqfxw bgrfd yomuu ecllm srzck iwgel khgyl wobzv zywem fkbjz gulky zoseh zpotb pnwey ceprg dxgpq kpnyf sgkrh itblz zbfgy wwjev spzra vhryd dcohp sfqgm xvclh auqgt olabw xovpd dixuw xfgcu wkqey wzvwa tiyuw vgucw wfvlh ufafi wzhqk znycz ezgcl sipnk ogsay nstrj briiw shikk hdkyr xqhoa hlprm lfmxu ecnqi vtrff agmwb kqfsm gratu cleyn bgwlu rzpyx psnvo xtmgg qtnqh chhio dgssb kokfk xpswt jajtw yktop fflai rkemd qakoa qdmbj fitjt vgcao zjquq tyfad drofs teptc vzcai ryksb rqcuw pdzuj ljniw vcyqv sltzh vnmlt gwvcw gmpja wefui wshay csmtr qmucf rhpmn ltish gdfdn tefms znhcf feanh zosdw mwoml ebymv kbqbd ohzdt pufnl lwzhq ptyff reeba lphgs"
      }
    ],
    "temperature": 1.0,
    "max_tokens": 100,
    "stream": false
  }'

Response on dual macs

The model is served by uv run mlx.launch --hostfile topology.json --env MLX_METAL_FAST_SYNCH=1 --env HF_HUB_OFFLINE=1 -- /Users/imbible/projects/ai/mlx-dist/.venv/bin/mlx_lm.server --model mlx-community/GLM-5.2-DQ4plus-q8 --port 8080 --host 0.0.0.0

{"id": "chatcmpl-9d20dfd7-d425-4add-9856-0201ad7836f7", "system_fingerprint": "0.31.3-0.31.2-macOS-26.5.1-arm64-arm-64bit-Mach-O-applegpu_g15d", "object": "chat.completion", "model": "mlx-community/GLM-5.2-DQ4plus-q8", "created": 1781985842, "choices": [{"index": 0, "finish_reason": "length", "message": {"role": "assistant", "reasoning": "SandersKWKWSandersKWSandersDDSKWSandersKWKWSandersDDSKWSandersKWSandersKWJSandersKWSandersSandersKWSandersKWSandersKWSandersKWSandersKWSandersDDSKWSandersSandersKWSandersDDSKWSandersSandersSandersSandersSandersDDSKWSandersKWSandersDDSDDSKWSandersDDSSandersDDSSandersDDSKWSandersDDSDDSDDSDDSDDSSandersDDSKWSandersSandersDDSDDSDDSDDSSandersDDSDDSDDSSandersDDSDDSDDSDDSSandersDDSSandersDDSDDSDDSDDSDDSSandersDDSDDSDDSDDSDDSDDS"}}], "usage": {"prompt_tokens": 406, "completion_tokens": 100, "total_tokens": 506, "prompt_tokens_details": {"cached_tokens": 405}}}

Response on single mac

The model is served by uv run mlx_lm.server --model mlx-community/GLM-5.2-DQ4plus-q8 --host 0.0.0.0 --port 8080

{"id": "chatcmpl-b22f8d75-7676-42ae-ba2d-33989b113eac", "system_fingerprint": "0.31.3-0.31.2-macOS-26.5.1-arm64-arm-64bit-Mach-O-applegpu_g15d", "object": "chat.completion", "model": "mlx-community/GLM-5.2-DQ4plus-q8", "created": 1781987386, "choices": [{"index": 0, "finish_reason": "length", "message": {"role": "assistant", "reasoning": "The user wants me to process a sequence of 5-letter words.\n\nThere are 168 words in total.\n\nLet's look at the words:\nqahft rxcka fnafq ofpva usiey iccwp usnzj ovqwp sbfhc gchqj jfgyq pesej zqorv ufaig fywir kxlgg ogpxk fzncb cqukb jznzw asrng qclly wgnex wh"}}], "usage": {"prompt_tokens": 406, "completion_tokens": 100, "total_tokens": 506, "prompt_tokens_details": {"cached_tokens": 0}}}

Gemini 3.1 Pro says:
It is entirely caused by mlx-lm 's Tensor Parallelism ( shard_inplace ) improperly slicing the mixed-bitwidth arrays (5-bit and 6-bit quantization groups) in the DQ4plus-q8 format, shearing the group boundaries and poisoning the math on those specific MoE experts.

It is interesting that I also tested mlx-community/GLM-5.2-mxfp4 and it behaves even more weird than this model on dual-mac TP through mlx-lm jaccl. At least this model with DQ4plus respond to some short prompts well. The mxfp4 one almost outputs gibberish regardless of prompt length.

MLX Community org

@labhraighlep @imbible

Thank you for the initial feedback already.

You can also try to add --chat-template-config '{"enable_thinking": false}' to the command-line of mlx. If your prompts work with this, then it means the reasoning causes the issues you see. That might be useful to discover what is happening here.

Same as @labhraighlep : gibberish output, but seen on other quantizations:

  • kernelpool/GLM-5.2-8bit
  • pipenetwork/GLM-5.2-MLX-4bit

Tested both with TP across 2 Mac Studio (for 8bit version) and without (for 4bit version on 1 Mac).

Always the same: above a certain system prompt size, the output is just garbage (broken language, repetitions, even double </think>).

I will test @bibproj suggestion later on.

Easy to reproduce with a brand new installation of OpenClaw:

  • cloud Full Precision: just works
  • any local quantized version (with EXO): never works
MLX Community org

I checked my settings. I was running with reasoning off when I first encountered this.
[ 12:04:50.6360PM | INFO ] Executing command: TextGeneration(command_id='76a007ba-eca9-4265-9c09-7326c8fc2fa9' task_params=TextGenerationTaskParams(model='mlx-community/GLM-5.2-DQ4plus-q8', input=[InputMessage(role='user', content=<InputMessageContent: Who are you?>)], instructions=None, max_output_tokens=None, temperature=1.0, top_p=0.95, stream=True, tools=None, bench=False, use_prefix_cache=False, top_k=None, stop=None, seed=None, chat_template_messages=[{'role': 'user', 'content': <InputMessageContent: Who are you?>}],

--> reasoning_effort='none', enable_thinking=False,

logprobs=False, top_logprobs=None, min_p=None, repetition_penalty=None, repetition_context_size=None, presence_penalty=None, frequency_penalty=1.1, images=[], image_hashes={}, prefill_endpoint=None))

@bibproj
The PR #1410 (https://github.com/ml-explore/mlx-lm/pull/1410) fixes everything for me! All the models I mentioned, single Mac or two with TP/RDMA.

I guess mlx-community/GLM-5.2-DQ4plus-q8 will work too (can't test, I did not download this one).

MLX Community org

Seems solved. Can you confirm @labhraighlep and @imbible ?

I have applied the PR to the version of MLX-LM used by EXO. But, I won't be able to test this until later in the day.

Seems solved. Can you confirm @labhraighlep and @imbible ?

I can confirm this works correctly after applying 2 fixes. Besides the one you've put in readme (ml-explore/mlx-lm#1410) which I had already applied, the missing one that is essential for multi-GPU TP setups is ml-explore/mlx#3451. It is not in the latest release of mlx, so it is required to build from main.

MLX Community org

Thank you @imbible . Good news that this is now working properly!

bibproj changed discussion status to closed

@cgeekm

The PR #1410 (https://github.com/ml-explore/mlx-lm/pull/1410) fixes everything for me! All the models I mentioned, single Mac or two with TP/RDMA.

I guess mlx-community/GLM-5.2-DQ4plus-q8 will work too (can't test, I did not download this one).

Is the 1410 PR the only change you made? Did you get the quants you are running working in EXO? I got it running on the 512 only in MLX-VLM. I have dissimilar memory configs on the 2 Ultras I have which complicates sharding outside of EXO.

@cgeekm

The PR #1410 (https://github.com/ml-explore/mlx-lm/pull/1410) fixes everything for me! All the models I mentioned, single Mac or two with TP/RDMA.

I guess mlx-community/GLM-5.2-DQ4plus-q8 will work too (can't test, I did not download this one).

Is the 1410 PR the only change you made? Did you get the quants you are running working in EXO? I got it running on the 512 only in MLX-VLM. I have dissimilar memory configs on the 2 Ultras I have which complicates sharding outside of EXO.

Essentially yes, only this PR. But applied on mlx-lm = { git = "https://github.com/rltakashige/mlx-lm", branch = "leo/deepseek-v4" } which I forked on my own GitLab to apply the PR (leo/deepseek-v4 is EXO's dependency from commit https://github.com/exo-explore/exo/commit/90f24bef30ceef09b45946970f53b43bf44c3206 from which I rebased recently).

I'm running exclusively with EXO, on 2 Mac Studio 512GB with TP through RDMA, with kernelpool/GLM-5.2-8bit. Using it a lot since my message (with an OpenClaw fed by a custom Kanban for several hours long). Encountered zero problem, working like a charm.

Cool...good to know. I had done the same was also from the Leo branch to ensure the deep-v4 settings were in the baseline before applying PR 1410. No luck, I got this specific model to launch 1 time on the 512 and it failed on sharding every time I attempted to use distributed using pipeline (not tp). I'm not distributing on 2 * 512 . 512 and a 256.

IMO, this issue was closed too early.

Question, did you get this model specifically to load/distribute for you? I'll look at the explore/mlx#3451 precursor @imbible mentioned. I suspect he also has 2 * 512s.

Cool...good to know. I had done the same was also from the Leo branch to ensure the deep-v4 settings were in the baseline before applying PR 1410. No luck, I got this specific model to launch 1 time on the 512 and it failed on sharding every time I attempted to use distributed using pipeline (not tp). I'm not distributing on 2 * 512 . 512 and a 256.

IMO, this issue was closed too early.

Question, did you get this model specifically to load/distribute for you? I'll look at the explore/mlx#3451 precursor @imbible mentioned. I suspect he also has 2 * 512s.

Hi, i'm in the same boat as you 2 *512s. gonna try building off mlx#3451, will report on how it goes

unfortunately did not fix my problem.

@labhraighlep

Here is my EXO setup: https://github.com/c-geek/exo

It actually integrates two patches:

I did not get mlx-community/GLM-5.2-DQ4plus-q8 specifically yet. But I will in the coming days.

@cgeekm

Thank you so much for the in-depth response because it was a major boost to me. That JACCL patch was huge because the the model loading and then stalling at prefill was causing memory to not be released and consistent reboots to continue testing. Exhausting and time consuming. I baselined from your repo, very helpful and I got the mlx 4 bit model and the Q8 one this discussion is referencing to load on the single 512. However; once I attempted to redistribute, I ran into an error. Thankfully, the memory was releasing even with the failure which sped up my troubleshooting. There is an issue with the auto_parallel.py script where auto_parallel sends a tuple to mx.distributed.send() which expects an mx.array. After working through a fix for this script with the help of chatgpt, we got it working and loading on the 512 and 256. Worked for both the 4 bit and the dq4plus-q8 quant which both failed to redistribute originally.

Now I can look at long context for a bit and see if the originally issue seen with the quant is still present. πŸ‘

Edit: Garbled long context for this model is resolved by these fixes. Formatting is also working correctly now.

MLX Community org

Perhaps also relevant to the topic?
https://github.com/ml-explore/mlx-lm/pull/1431

MLX Community org

Perhaps also relevant to the topic?
https://github.com/ml-explore/mlx-lm/pull/1431

Thanks @bibproj for pointing that out! However, according to my tests GLM 5.2 is not affected by that issue, only Deepseek 3.2 is.

@labhraighlep

Glad to see everything works now!

Now that you mention it, I remember that I struggled into this RAM not being released issue. Had to reboot a lot too ... time and motivation consuming, indeed.

Sign up or log in to comment