Running models with vLLM on the RTX Pro 6000 - SM120

#28
by liku2001 - opened

I plan to refer to https://zhuanlan.zhihu.com/p/2031484558114337285 and https://github.com/deepseek-ai/DeepGEMM/pull/318,and I will try to verify this tomorrow.
So does the official vLLM officially support SM120 now? I really don't want to build vLLM from source.

Excited to see your verification results! SM120 support is exactly what I've been waiting for.

follow the article,After a full day of tough trial and error, I successfully compiled vLLM, launched DeepSeek, and completed interactive conversations. Unfortunately, the performance fell short of what was stated in the article — I only get 5 tokens per second on my setup with 8 RTX Pro 6000 GPUs.

Prior to this, I tried launching with the latest vLLM version v0.20.2, but it prompted that FlashAttention could not be found. Mimo v2.5 also failed to start up on SM120 GPU. I may switch to SGLang for testing tomorrow.

4x RTX PRO 6000 - sglang - 400k total context - 105 peak tok/s decode

https://github.com/0xSero/deepseek-v4-flash-sm120

DeepSeek-V4-Flash Deployment Guide for SM120 (RTX PRO 6000 Blackwell)

Environment Information

  • GPU: 8× NVIDIA RTX PRO 6000 Blackwell Server Edition (SM120, compute capability 12.0)
  • CUDA: 12.8+
  • Python: 3.10
  • Framework: vLLM (jasl/ds4-sm120-preview branch)
  • Model: DeepSeek-V4-Flash (FP8)

1. Environment Preparation

# Navigate to the working directory
cd /home/guest/vllm-sm120-git-dir/vllm

# Activate the virtual environment
source .venv/bin/activate

# Verify the current branch
git branch --show-current
# Expected output: ds4-sm120-preview

Critical Note: You must use the ds4-sm120-preview branch. The sm120-full branch lacks the SM120 Triton fallback code required for deployment.


2. Compile vLLM

# Set the DeepGEMM source path (jasl's fork with SM120 support)
export DEEPGEMM_SRC_DIR=/home/guest/vllm-sm120-git-dir/DeepGEMM

# Compile and install vLLM
MAX_JOBS=64 pip install --no-build-isolation -e . --verbose

3. Start the Service

Recommended Startup Command

source .venv/bin/activate

CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
  VLLM_TRITON_MLA_SPARSE=1 \
  VLLM_TRITON_MLA_SPARSE_TOPK_CHUNK_SIZE=256 \
  VLLM_TRITON_MLA_SPARSE_QUERY_CHUNK_SIZE=128 \
  VLLM_TRITON_MLA_SPARSE_ALLOW_CUDAGRAPH=1 \
  nohup vllm serve /root/.cache/modelscope/hub/models/deepseek-ai/DeepSeek-V4-flash \
   --trust-remote-code \
   --kv-cache-dtype fp8 \
   --block-size 256 \
   --tensor-parallel-size 8 \
   --tokenizer-mode deepseek_v4 \
   --tool-call-parser deepseek_v4 \
   --enable-auto-tool-choice \
   --reasoning-parser deepseek_v4 \
   --host 0.0.0.0 --port 8005 \
   --served-model-name DeepSeek-V4-flash \
   --gpu-memory-utilization 0.93 \
   --max-num-seqs 4 \
   --max-model-len 262144 \
   --async-scheduling \
   --enable-prefix-caching \
   --load-format auto \
   --pipeline-parallel-size 1 \
   --enable-expert-parallel \
   --compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE"}' > /tmp/vllm_server.log 2>&1 &

Environment Variable Explanations

Variable Function SM120 Default Value
VLLM_TRITON_MLA_SPARSE=1 Enables Triton sparse MLA attention Auto-detected (enabled by default for SM120)
VLLM_TRITON_MLA_SPARSE_TOPK_CHUNK_SIZE=256 TopK chunk size 512
VLLM_TRITON_MLA_SPARSE_QUERY_CHUNK_SIZE=128 Query chunk size 256
VLLM_TRITON_MLA_SPARSE_ALLOW_CUDAGRAPH=1 Enables CUDA graphs 0 (disabled by default)

These environment variables are INVALID for SM120 – DO NOT use:

  • VLLM_SM120_REFERENCE_DEEPSEEK_V4_ATTENTION — Non-existent
  • VLLM_SM120_REFERENCE_TOPK_CHUNK_SIZE — Non-existent
  • VLLM_SM120_REFERENCE_QUERY_CHUNK_SIZE — Non-existent
  • VLLM_ATTENTION_BACKEND=FLASH_ATTN_2 — Invalid value; SM120 automatically uses TRITON_MLA

4. Testing

Chat Functionality Test

curl -s http://localhost:8005/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "DeepSeek-V4-flash",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 100,
    "temperature": 0.7
  }' | python3 -m json.tool

Streaming Performance Test

curl -s http://localhost:8005/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "DeepSeek-V4-flash",
    "messages": [{"role": "user", "content": "Write a detailed paragraph about AI."}],
    "max_tokens": 512,
    "temperature": 0.7,
    "stream": true
  }' 2>/dev/null | python3 -c "
import sys, time
start = time.time()
tokens = 0
for line in sys.stdin:
    if line.startswith('data: ') and line.strip() != 'data: [DONE]':
        import json
        data = json.loads(line[6:])
        choices = data.get('choices', [])
        if choices and choices[0].get('delta', {}).get('content'):
            tokens += 1
            sys.stdout.write(choices[0]['delta']['content'])
            sys.stdout.flush()
elapsed = time.time() - start
print(f'\n\n--- Performance ---')
print(f'Generated {tokens} tokens in {elapsed:.2f}s')
print(f'Throughput: {tokens/elapsed:.2f} tok/s')
"

5. Actual Performance Results

Configuration Throughput
No CUDA graphs + invalid env vars ~5 tok/s
CUDA graphs enabled + correct env vars 30-35 tok/s

Key Optimization Points

  1. CUDA graphs are the most critical optimizationVLLM_TRITON_MLA_SPARSE_ALLOW_CUDAGRAPH=1 must be set. Otherwise, sparse_mla_env.py will forcibly disable CUDA graphs on SM120.
  2. DeepGEMM must NOT be enabled on SM120support_deep_gemm() must only return True for SM90/SM100. While DeepGEMM imports successfully, the transform_sf_into_required_layout function in csrc/apis/layout.hpp only supports arch_major 9 and 10; arch_major 12 triggers a DG_HOST_UNREACHABLE("Unknown SF transformation") error and crashes model loading.
  3. SM120 attention path — Implemented via Triton/PyTorch fallbacks in sm12x_deep_gemm_fallbacks.py and sm12x_mqa.py, independent of DeepGEMM-compiled CUDA kernels.

Known Performance Limitations

Limitation Root Cause
Custom allreduce disabled 8 PCIe GPUs without NVLink (world_size > 2 && !fully_connected)
SymmMemCommunicator unavailable SM120 (12.0) is not in the supported list
W8A8 Block FP8 uses default config No optimized profile for NVIDIA_RTX_PRO_6000_Blackwell_Server_Edition
Communication via NCCL over PCIe Limited cross-GPU communication bandwidth without NVLink

6. Troubleshooting

CUDA Graphs Disabled

Log warning:

WARNING sparse_mla_env.py:101 Disabling CUDA graphs for the DeepSeek V4 Triton
sparse MLA path by default. Set VLLM_TRITON_MLA_SPARSE_ALLOW_CUDAGRAPH=1 to opt
into the experimental graph-captured path. vLLM compile remains enabled.

Solution: Set VLLM_TRITON_MLA_SPARSE_ALLOW_CUDAGRAPH=1

Unknown SF Transformation Error

Log error:

RuntimeError: Assertion error (csrc/apis/layout.hpp:59): Unknown SF transformation

Cause: support_deep_gemm() returns True for SM120, but DeepGEMM's C++ module does not support SM120's scale factor format conversion.

Solution: Ensure support_deep_gemm() in vllm/platforms/cuda.py excludes SM120:

def support_deep_gemm(cls) -> bool:
    return (cls.is_device_capability(90)
            or cls.is_device_capability_family(100))

Unreleased GPU Memory

ValueError: Free memory on device cuda:0 (2.21/94.97 GiB) on startup is less
than desired GPU memory utilization (0.93, 88.32 GiB).

Solution: Kill residual processes and retry:

ps aux | grep "vllm serve" | grep -v grep | awk '{print $2}' | xargs kill -9
# Or check with nvidia-smi and wait for memory release

Cooperate with Claude Code to modify the source code, and deploy DeepSeek using a third-party developer's repository in the article . The throughput has now increased from 5 to 30 tokens per second, close to the normal level.

I will try to use sglang,i am download the fp8 version sglang deepseek weight

claude code is good!!:# DeepSeek-V4-Flash-FP8 Deployment and Performance Test Report

Environment

Item Specification
GPU 8× NVIDIA RTX PRO 6000 Blackwell Server Edition (SM120)
VRAM ~96 GB per card
Model sgl-project/DeepSeek-V4-Flash-FP8 (274 GB, 46 shards)
Inference Engine SGLang (lmsysorg/sglang:deepseek-v4-blackwell)
Patch deepseek-v4-flash-sm120 (SM120 FlashMLA sparse-decode patch)
Container Docker 29.4.3

Deployment Steps

1. Download Model Weights

# Download via ModelScope (completed in advance)
# Path: Local model cache directory/models/sgl-project/DeepSeek-V4-Flash-FP8/
# Total 46 safetensors shards, 274 GB in total

2. Clone SM120 Patch Repository

git clone https://github.com/0xSero/deepseek-v4-flash-sm120.git
cd deepseek-v4-flash-sm120
git submodule update --init --recursive

3. Pull SGLang Image

docker pull lmsysorg/sglang:deepseek-v4-blackwell

4. Compile SM120 CUDA Extension

# Compile inside SGLang container; outputs saved to build-docker/
scripts/build_in_sglang_docker.sh

Build Artifacts:

  • build-docker/sitecustomize.py
  • build-docker/deepseek_v4_kernel/cuda.cpython-312-x86_64-linux-gnu.so
  • build-docker/deepseek_v4_kernel/_patch.py
  • etc.

5. Launch Service

docker run \
  --name sglang-dsv4 \
  --gpus all \
  --privileged \
  --shm-size=64g \
  --ipc=host \
  --ulimit memlock=-1 --ulimit stack=67108864 \
  --network host \
  -v /custom-model-dir:/workspace/model:ro \
  -v /patch-project-dir/build-docker:/dsv4:ro \
  -e PYTHONPATH=/dsv4 \
  -e CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
  -e SGLANG_ENABLE_THINKING=1 \
  -e SGLANG_REASONING_EFFORT=max \
  lmsysorg/sglang:deepseek-v4-blackwell \
  python3 -m sglang.launch_server \
    --model-path /workspace/model \
    --host 0.0.0.0 \
    --port 8000 \
    --served-model-name deepseek-v4-flash \
    --trust-remote-code \
    --tensor-parallel-size 8 \
    --context-length 393216 \
    --mem-fraction-static 0.85 \
    --max-running-requests 16 \
    --kv-cache-dtype fp8_e4m3 \
    --tool-call-parser deepseekv4 \
    --reasoning-parser deepseek-v4 \
    --attention-backend compressed \
    --fp8-gemm-backend triton \
    --moe-runner-backend triton \
    --chunked-prefill-size 8192 \
    --watchdog-timeout 3600 \
    --page-size 256 \
    --speculative-algorithm EAGLE \
    --speculative-num-steps 1 \
    --speculative-eagle-topk 1 \
    --speculative-num-draft-tokens 2 \
    --speculative-attention-mode decode \
    --cuda-graph-max-bs 32 \
    --enable-return-routed-experts

6. Service Validation

curl -s http://127.0.0.1:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"deepseek-v4-flash","temperature":0,"max_tokens":32,
       "messages":[{"role":"user","content":"Say OK only."}]}'

Key Parameter Description

Parameter Value Description
tensor-parallel-size 8 8-card tensor parallelism
context-length 393216 Maximum context length
kv-cache-dtype fp8_e4m3 FP8 precision for KV cache
speculative-algorithm EAGLE Speculative decoding acceleration
speculative-num-draft-tokens 2 2 draft tokens per step
cuda-graph-max-bs 32 Max batch size for CUDA Graph
page-size 256 Page size of KV cache
mem-fraction-static 0.85 Static memory allocation ratio

Performance Test Results

Chat Scenario Benchmark

Scenario Input Tok Output Tok Latency Decode Speed
Simple Q&A 11 55 1.47s 37.4 tok/s
Chinese Writing 13 199 4.26s 46.7 tok/s
Code Generation 16 256 5.58s 45.8 tok/s
Logical Reasoning 60 256 6.60s 38.8 tok/s
Multi-turn Dialogue 35 256 5.19s 49.4 tok/s

In short-context chat scenarios, the decoding speed ranges roughly from 37 to 49 tok/s with good output quality.

Context Length vs Throughput

Target Context Actual Input Tok Output Tok Total Latency Prefill Speed (Overall)
4K 4,014 60 10.68s 376 tok/s
8K 8,014 57 5.65s 1,419 tok/s
16K 16,014 52 7.99s 2,003 tok/s
32K 32,014 57 16.52s 1,938 tok/s

The optimal Prefill speed reaches ~2000 tok/s at 16K–32K context length.

EAGLE Speculative Decoding

Round Speed
1 49.9 tok/s
2 44.8 tok/s
3 41.1 tok/s
4 45.7 tok/s
5 46.6 tok/s
Average 45.6 tok/s

With 2 draft tokens enabled, EAGLE speculative decoding delivers a stable average speed of ~45 tok/s in short dialogue scenarios.

Issues & Notes

SM120 Patch Mounting

The built-in FlashMLA in official SGLang images only supports SM90/SM100. The SM120 patch must be injected via:

  • Read-only mount of compiled artifacts: -v ./build-docker:/dsv4:ro
  • Add patch to Python path: -e PYTHONPATH=/dsv4

Successful validation log:
deepseek_v4_kernel.patch_flash_mla installed (device SM 12.0)

W8A8 / MoE Performance Warning

The following warning repeatedly appears in startup logs:

Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal!
Using default MoE kernel config. Performance might be sub-optimal!

RTX PRO 6000 Blackwell is new hardware with no official optimized preset config integrated into SGLang yet. It does not affect inference correctness, only peak token throughput in batch decoding. Custom optimized configs can be generated via benchmarking for further tuning.

Reasoning Output Format

When SGLANG_ENABLE_THINKING=1 is enabled, model reasoning content is wrapped in specific tags. Currently the reasoning_content field is not parsed separately, and reasoning text is mixed into the standard content field. Further updates to SGLang tokenizer are required to fix this.

Long Context Decode Performance

Decode speed drops significantly to 3–5 tok/s at 64K+ context length, mainly due to:

  1. The current SM120 sparse-decode patch prioritizes functional correctness over raw performance
  2. Future optimization via split-KV / multi-CTA sparse decode kernels
  3. Dedicated tile configuration tuning for W8A8/MoE on RTX PRO 6000 Blackwell

Conclusion

DeepSeek-V4-Flash-FP8 is successfully deployed and operational on 8× RTX PRO 6000 Blackwell. It delivers 40–50 tok/s decoding speed for daily chat workloads and up to 2000 tok/s Prefill throughput for long contexts. The main bottleneck lies in long-context decoding performance, which requires further low-level kernel optimization.

DeepSeek-V4-Flash Deployment Guide for SM120 (RTX PRO 6000 Blackwell)

Environment Information

  • GPU: 8× NVIDIA RTX PRO 6000 Blackwell Server Edition (SM120, compute capability 12.0)
  • CUDA: 12.8+
  • Python: 3.10
  • Framework: vLLM (jasl/ds4-sm120-preview branch)
  • Model: DeepSeek-V4-Flash (FP8)

1. Environment Preparation

# Navigate to the working directory
cd /home/guest/vllm-sm120-git-dir/vllm

# Activate the virtual environment
source .venv/bin/activate

# Verify the current branch
git branch --show-current
# Expected output: ds4-sm120-preview

Critical Note: You must use the ds4-sm120-preview branch. The sm120-full branch lacks the SM120 Triton fallback code required for deployment.


2. Compile vLLM

# Set the DeepGEMM source path (jasl's fork with SM120 support)
export DEEPGEMM_SRC_DIR=/home/guest/vllm-sm120-git-dir/DeepGEMM

# Compile and install vLLM
MAX_JOBS=64 pip install --no-build-isolation -e . --verbose

where is the branch ?

image

DeepSeek-V4-Flash Deployment Guide for SM120 (RTX PRO 6000 Blackwell)

Environment Information

  • GPU: 8× NVIDIA RTX PRO 6000 Blackwell Server Edition (SM120, compute capability 12.0)
  • CUDA: 12.8+
  • Python: 3.10
  • Framework: vLLM (jasl/ds4-sm120-preview branch)
  • Model: DeepSeek-V4-Flash (FP8)

1. Environment Preparation

# Navigate to the working directory
cd /home/guest/vllm-sm120-git-dir/vllm

# Activate the virtual environment
source .venv/bin/activate

# Verify the current branch
git branch --show-current
# Expected output: ds4-sm120-preview

Critical Note: You must use the ds4-sm120-preview branch. The sm120-full branch lacks the SM120 Triton fallback code required for deployment.


2. Compile vLLM

# Set the DeepGEMM source path (jasl's fork with SM120 support)
export DEEPGEMM_SRC_DIR=/home/guest/vllm-sm120-git-dir/DeepGEMM

# Compile and install vLLM
MAX_JOBS=64 pip install --no-build-isolation -e . --verbose

where is the branch ?

image

follow the article: https://zhuanlan.zhihu.com/p/2031484558114337285 and https://github.com/deepseek-ai/DeepGEMM/pull/318 , the branch is from:

git clone https://github.com/jasl/vllm.git
cd vllm

git remote add jasl https://github.com/jasl/vllm.git

git fetch jasl

git checkout -b ds4-sm120 jasl/ds4-sm120

however,I recommend to use sglang,more easily。

Sign up or log in to comment