2x RTX P6000BW produces garbage
Hi
Trying to run this exactly as configured in the model card for 2x RTX Pro (fix the gpu devices for 2 GPUs), I get garbage results after about 100 tokens. It seems related to https://github.com/sgl-project/sglang/issues/24321
Is this issue being tracked in this project at all?
I can reproduce a similar garbage-token correctness issue on 2× RTX PRO 6000 Blackwell with lukealonso/MiMo-V2.5-NVFP4 and docker.io/lukealonso/sglang-cuda13-b12x.
Unlike the +++++ degeneration reported in that SGL issue, my output is multilingual/random-token gibberish, but the failure mode is similar: deterministic decoding is corrupted.
Tested:
- TP=2
- no EAGLE / no speculative decoding
- no PCIe oneshot allreduce
- temperature=0
- raw /v1/completions, not only chat completions
- short prompt: “Answer in English only. What is 2+2? Reply with exactly one short sentence.”
Example output:
“4โทรศัพ Featuringมั้ sabe彩(init ...”
Model loads successfully:
- type=MiMoV2ForCausalLM
- quant=modelopt_mixed
- detects ModelOpt FP8 and NVFP4
- per-GPU memory ~85.7 GB
- FP8 KV profile boots
BF16/auto KV does not fit on 2×96GB; SGLang reports negative token budget.
PCIe oneshot allreduce also fails during CUDA graph capture at b12x/distributed/pcie_oneshot.cu:235, but disabling it allows the server to start. Output remains corrupted.