2x RTX P6000BW produces garbage

#11

by dfv157 - opened 5 days ago

Trying to run this exactly as configured in the model card for 2x RTX Pro (fix the gpu devices for 2 GPUs), I get garbage results after about 100 tokens. It seems related to https://github.com/sgl-project/sglang/issues/24321

Is this issue being tracked in this project at all?

ebfio

4 days ago

•

edited 4 days ago

Same issue with TP=4, @lukealonso

PerPartes

2 days ago

I can reproduce a similar garbage-token correctness issue on 2× RTX PRO 6000 Blackwell with lukealonso/MiMo-V2.5-NVFP4 and docker.io/lukealonso/sglang-cuda13-b12x.

Unlike the +++++ degeneration reported in that SGL issue, my output is multilingual/random-token gibberish, but the failure mode is similar: deterministic decoding is corrupted.

Tested:

TP=2
no EAGLE / no speculative decoding
no PCIe oneshot allreduce
temperature=0
raw /v1/completions, not only chat completions
short prompt: “Answer in English only. What is 2+2? Reply with exactly one short sentence.”

Example output:
“4โทรศัพ Featuringมั้ sabe彩(init ...”

Model loads successfully:

type=MiMoV2ForCausalLM
quant=modelopt_mixed
detects ModelOpt FP8 and NVFP4
per-GPU memory ~85.7 GB
FP8 KV profile boots

BF16/auto KV does not fit on 2×96GB; SGLang reports negative token budget.
PCIe oneshot allreduce also fails during CUDA graph capture at b12x/distributed/pcie_oneshot.cu:235, but disabling it allows the server to start. Output remains corrupted.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment