2x RTX P6000BW produces garbage

#11
by dfv157 - opened

Hi

Trying to run this exactly as configured in the model card for 2x RTX Pro (fix the gpu devices for 2 GPUs), I get garbage results after about 100 tokens. It seems related to https://github.com/sgl-project/sglang/issues/24321

Is this issue being tracked in this project at all?

Same issue with TP=4, @lukealonso

I can reproduce a similar garbage-token correctness issue on 2× RTX PRO 6000 Blackwell with lukealonso/MiMo-V2.5-NVFP4 and docker.io/lukealonso/sglang-cuda13-b12x.

Unlike the +++++ degeneration reported in that SGL issue, my output is multilingual/random-token gibberish, but the failure mode is similar: deterministic decoding is corrupted.

Tested:

  • TP=2
  • no EAGLE / no speculative decoding
  • no PCIe oneshot allreduce
  • temperature=0
  • raw /v1/completions, not only chat completions
  • short prompt: “Answer in English only. What is 2+2? Reply with exactly one short sentence.”

Example output:
“4โทรศัพ Featuringมั้ sabe彩(init ...”

Model loads successfully:

  • type=MiMoV2ForCausalLM
  • quant=modelopt_mixed
  • detects ModelOpt FP8 and NVFP4
  • per-GPU memory ~85.7 GB
  • FP8 KV profile boots

BF16/auto KV does not fit on 2×96GB; SGLang reports negative token budget.
PCIe oneshot allreduce also fails during CUDA graph capture at b12x/distributed/pcie_oneshot.cu:235, but disabling it allows the server to start. Output remains corrupted.

Sign up or log in to comment