Docker image fails on 4x RTX PRO 6000 Blackwell with MSA sparse backend v1 is bf16-KV only (got fp8_e4m3)
Hi, thank you for publishing this quant and the ready-to-run b12x image.
I am trying to run brandonmusic/MiniMax-M3-NVFP4 on a 4x NVIDIA RTX PRO 6000 Blackwell Server Edition machine.
Model path:
/data/models/MiniMax-M3-NVFP4
Docker image:
verdictai/minimax-m3-nvfp4-b12x:v1
Command used:
docker run --rm
--name minimax-m3-nvfp4
--gpus '"device=0,1,2,3"'
--shm-size 32g
-p 8080:9211
-e TP=4
-e MAXLEN=131072
-e GPU_UTIL=0.90
-e MAXSEQS=8
-v /data/models/MiniMax-M3-NVFP4:/model:ro
verdictai/minimax-m3-nvfp4-b12x:v1
The model seems to be detected correctly and the JIT pre-warm completes:
=== MiniMax-M3 native b12x image ===
model=/model tp=4 util=0.90 maxlen=131072 maxseqs=8 graphcap=16 port=9211
warm_jit: fused M3 swigluoai compiled
warm_jit: sparse_topk_select compiled
warm_jit: msa_prefill pipeline compiled
warm_jit: device-resident decode indexer compiled
warm_jit: sparse_decode_fused compiled
warm_jit: fused MSA KV insert compiled
warm_jit: native fused MSA block indexer compiled
warm_jit: DONE
Resolved architecture: MiniMaxM3SparseForConditionalGeneration
However, startup fails during worker initialization with:
NotImplementedError: MSA sparse backend v1 is bf16-KV only (got fp8_e4m3); fp8 KV is a follow-up.
The log also shows:
Using fp8_e4m3 data type to store kv cache.
I tried explicitly passing:
-e KV_CACHE_DTYPE=bfloat16
but it did not change the behavior.
I inspected the container and found that /opt/m3/entrypoint.sh hardcodes:
--block-size 128 --kv-cache-dtype auto
Relevant grep output:
/opt/m3/entrypoint.sh:28: --block-size 128 --kv-cache-dtype auto
/opt/m3/serving/boot_m3.sh:87: --block-size 128 --kv-cache-dtype auto
/opt/m3/serving/BOOT_PLAN.md:81:- --kv-cache-dtype auto — the config's kv_cache_scheme is fp8, but there
/opt/m3/serving/msa_backend.py:755: supported_kv_cache_dtypes = ["auto", "bfloat16"]
/opt/m3/serving/msa_backend.py:797: supported_kv_cache_dtypes = ["auto", "bfloat16"]
/opt/m3/serving/msa_backend.py:1303: if kv_cache_dtype not in ("auto", "bfloat16"):
/opt/m3/serving/msa_backend.py:1305: f"MSA sparse backend v1 is bf16-KV only (got {kv_cache_dtype}); "
It looks like --kv-cache-dtype auto resolves to fp8_e4m3 because the model config has an fp8 KV cache scheme, but the current MSA sparse backend supports only BF16 KV.
Could you please clarify the intended way to run this image?
Specific questions:
Should --kv-cache-dtype be forced to bfloat16 for the current v1 image?
Is there an environment variable supported by the entrypoint to override --kv-cache-dtype?
Should the model config be patched to remove/disable the fp8 KV cache scheme?
Is there a newer Docker tag than verdictai/minimax-m3-nvfp4-b12x:v1 that fixes this?
Could you provide the exact command that you used successfully on 4x RTX PRO 6000 / SM120?
Thank you.
Update: I was able to start the model with this workaround:
sed -i "s/--kv-cache-dtype auto/--kv-cache-dtype bfloat16/g" /opt/m3/entrypoint.sh /opt/m3/serving/boot_m3.sh
The full command works with:
TP=4
MAXLEN=196000
GPU_UTIL=0.93
MAXSEQS=4
However, I’m not sure whether this workaround preserves the intended model quality. Since the image defaults to --kv-cache-dtype auto, and auto resolves to fp8_e4m3, forcing bfloat16 may change the intended runtime behavior.
Could you confirm whether using --kv-cache-dtype bfloat16 is the correct/validated configuration for this quant, or whether it may affect output quality compared to your expected setup?
There are updated suggested images, on github "local inference lab" for rtxpro6k. Or the other comment where we discussd simlar issue: services:
sglang-mm:
image: scitrera/dgx-spark-sglang-mm:v0
restart: unless-stopped
container_name: minimax-m3
entrypoint: ["/opt/nvidia/nvidia_entrypoint.sh"]
ports:
- "9060:8000"
volumes:
- /mnt/data/nvme1n1/models/MiniMax-M3-NVFP4:/models/MiniMax-M3-NVFP4:ro
environment:
- TORCH_NCCL_ASYNC_ERROR_HANDLING=1
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 4
capabilities: [gpu]
shm_size: "32gb"
command: >
sglang serve
--model-path /models/MiniMax-M3-NVFP4
--served-model-name minimax-m3
--trust-remote-code
--mem-fraction-static "0.74"
--context-length "196608"
--tp "4"
--chunked-prefill-size "8192"
--fp4-gemm-backend cutlass
--load-format instanttensor
--kv-cache-dtype auto
--reasoning-parser minimax-m3
--tool-call-parser minimax-m3
--moe-runner-backend triton
--trust-remote-code
--cuda-graph-max-bs "16"
--cuda-graph-bs "1" "2" "4" "8" "12" "16"
--max-running-requests 16
--allow-auto-truncate
--host "0.0.0.0"
--port "8000"
--watchdog-timeout 600
b12x works with fp8 cache just fine.