Instructions to use deepseek-ai/DeepSeek-V4-Flash with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use deepseek-ai/DeepSeek-V4-Flash with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="deepseek-ai/DeepSeek-V4-Flash")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-V4-Flash")
model = AutoModelForCausalLM.from_pretrained("deepseek-ai/DeepSeek-V4-Flash")

Inference
HuggingChat
Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use deepseek-ai/DeepSeek-V4-Flash with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "deepseek-ai/DeepSeek-V4-Flash"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "deepseek-ai/DeepSeek-V4-Flash",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/deepseek-ai/DeepSeek-V4-Flash

SGLang

How to use deepseek-ai/DeepSeek-V4-Flash with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "deepseek-ai/DeepSeek-V4-Flash" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "deepseek-ai/DeepSeek-V4-Flash",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "deepseek-ai/DeepSeek-V4-Flash" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "deepseek-ai/DeepSeek-V4-Flash",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use deepseek-ai/DeepSeek-V4-Flash with Docker Model Runner:
```
docker model run hf.co/deepseek-ai/DeepSeek-V4-Flash
```

Running models with vLLM on the RTX Pro 6000 - SM120

#28

by liku2001 - opened 9 days ago

Discussion

liku2001

9 days ago

I plan to refer to https://zhuanlan.zhihu.com/p/2031484558114337285 and https://github.com/deepseek-ai/DeepGEMM/pull/318,and I will try to verify this tomorrow.
So does the official vLLM officially support SM120 now? I really don't want to build vLLM from source.

fakecoder

8 days ago

Excited to see your verification results! SM120 support is exactly what I've been waiting for.

liku2001

7 days ago

follow the article，After a full day of tough trial and error, I successfully compiled vLLM, launched DeepSeek, and completed interactive conversations. Unfortunately, the performance fell short of what was stated in the article — I only get 5 tokens per second on my setup with 8 RTX Pro 6000 GPUs.

liku2001

7 days ago

Prior to this, I tried launching with the latest vLLM version v0.20.2, but it prompted that FlashAttention could not be found. Mimo v2.5 also failed to start up on SM120 GPU. I may switch to SGLang for testing tomorrow.

0xSero

7 days ago

4x RTX PRO 6000 - sglang - 400k total context - 105 peak tok/s decode

https://github.com/0xSero/deepseek-v4-flash-sm120

liku2001

6 days ago

•

edited 6 days ago

DeepSeek-V4-Flash Deployment Guide for SM120 (RTX PRO 6000 Blackwell)

Environment Information

GPU: 8× NVIDIA RTX PRO 6000 Blackwell Server Edition (SM120, compute capability 12.0)
CUDA: 12.8+
Python: 3.10
Framework: vLLM (jasl/ds4-sm120-preview branch)
Model: DeepSeek-V4-Flash (FP8)

1. Environment Preparation

# Navigate to the working directory
cd /home/guest/vllm-sm120-git-dir/vllm

# Activate the virtual environment
source .venv/bin/activate

# Verify the current branch
git branch --show-current
# Expected output: ds4-sm120-preview

Critical Note: You must use the ds4-sm120-preview branch. The sm120-full branch lacks the SM120 Triton fallback code required for deployment.

2. Compile vLLM

# Set the DeepGEMM source path (jasl's fork with SM120 support)
export DEEPGEMM_SRC_DIR=/home/guest/vllm-sm120-git-dir/DeepGEMM

# Compile and install vLLM
MAX_JOBS=64 pip install --no-build-isolation -e . --verbose

3. Start the Service

Recommended Startup Command

source .venv/bin/activate

CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
  VLLM_TRITON_MLA_SPARSE=1 \
  VLLM_TRITON_MLA_SPARSE_TOPK_CHUNK_SIZE=256 \
  VLLM_TRITON_MLA_SPARSE_QUERY_CHUNK_SIZE=128 \
  VLLM_TRITON_MLA_SPARSE_ALLOW_CUDAGRAPH=1 \
  nohup vllm serve /root/.cache/modelscope/hub/models/deepseek-ai/DeepSeek-V4-flash \
   --trust-remote-code \
   --kv-cache-dtype fp8 \
   --block-size 256 \
   --tensor-parallel-size 8 \
   --tokenizer-mode deepseek_v4 \
   --tool-call-parser deepseek_v4 \
   --enable-auto-tool-choice \
   --reasoning-parser deepseek_v4 \
   --host 0.0.0.0 --port 8005 \
   --served-model-name DeepSeek-V4-flash \
   --gpu-memory-utilization 0.93 \
   --max-num-seqs 4 \
   --max-model-len 262144 \
   --async-scheduling \
   --enable-prefix-caching \
   --load-format auto \
   --pipeline-parallel-size 1 \
   --enable-expert-parallel \
   --compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE"}' > /tmp/vllm_server.log 2>&1 &

Environment Variable Explanations

Variable	Function	SM120 Default Value
`VLLM_TRITON_MLA_SPARSE=1`	Enables Triton sparse MLA attention	Auto-detected (enabled by default for SM120)
`VLLM_TRITON_MLA_SPARSE_TOPK_CHUNK_SIZE=256`	TopK chunk size	512
`VLLM_TRITON_MLA_SPARSE_QUERY_CHUNK_SIZE=128`	Query chunk size	256
`VLLM_TRITON_MLA_SPARSE_ALLOW_CUDAGRAPH=1`	Enables CUDA graphs	0 (disabled by default)

These environment variables are INVALID for SM120 – DO NOT use:

VLLM_SM120_REFERENCE_DEEPSEEK_V4_ATTENTION — Non-existent

VLLM_SM120_REFERENCE_TOPK_CHUNK_SIZE — Non-existent

VLLM_SM120_REFERENCE_QUERY_CHUNK_SIZE — Non-existent

VLLM_ATTENTION_BACKEND=FLASH_ATTN_2 — Invalid value; SM120 automatically uses TRITON_MLA

4. Testing

Chat Functionality Test

curl -s http://localhost:8005/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "DeepSeek-V4-flash",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 100,
    "temperature": 0.7
  }' | python3 -m json.tool

Streaming Performance Test

curl -s http://localhost:8005/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "DeepSeek-V4-flash",
    "messages": [{"role": "user", "content": "Write a detailed paragraph about AI."}],
    "max_tokens": 512,
    "temperature": 0.7,
    "stream": true
  }' 2>/dev/null | python3 -c "
import sys, time
start = time.time()
tokens = 0
for line in sys.stdin:
    if line.startswith('data: ') and line.strip() != 'data: [DONE]':
        import json
        data = json.loads(line[6:])
        choices = data.get('choices', [])
        if choices and choices[0].get('delta', {}).get('content'):
            tokens += 1
            sys.stdout.write(choices[0]['delta']['content'])
            sys.stdout.flush()
elapsed = time.time() - start
print(f'\n\n--- Performance ---')
print(f'Generated {tokens} tokens in {elapsed:.2f}s')
print(f'Throughput: {tokens/elapsed:.2f} tok/s')
"

5. Actual Performance Results

Configuration	Throughput
No CUDA graphs + invalid env vars	~5 tok/s
CUDA graphs enabled + correct env vars	30-35 tok/s

Key Optimization Points

CUDA graphs are the most critical optimization — VLLM_TRITON_MLA_SPARSE_ALLOW_CUDAGRAPH=1 must be set. Otherwise, sparse_mla_env.py will forcibly disable CUDA graphs on SM120.
DeepGEMM must NOT be enabled on SM120 — support_deep_gemm() must only return True for SM90/SM100. While DeepGEMM imports successfully, the transform_sf_into_required_layout function in csrc/apis/layout.hpp only supports arch_major 9 and 10; arch_major 12 triggers a DG_HOST_UNREACHABLE("Unknown SF transformation") error and crashes model loading.
SM120 attention path — Implemented via Triton/PyTorch fallbacks in sm12x_deep_gemm_fallbacks.py and sm12x_mqa.py, independent of DeepGEMM-compiled CUDA kernels.

Known Performance Limitations

Limitation	Root Cause
Custom allreduce disabled	8 PCIe GPUs without NVLink (`world_size > 2 && !fully_connected`)
SymmMemCommunicator unavailable	SM120 (12.0) is not in the supported list
W8A8 Block FP8 uses default config	No optimized profile for `NVIDIA_RTX_PRO_6000_Blackwell_Server_Edition`
Communication via NCCL over PCIe	Limited cross-GPU communication bandwidth without NVLink

6. Troubleshooting

CUDA Graphs Disabled

Log warning:

WARNING sparse_mla_env.py:101 Disabling CUDA graphs for the DeepSeek V4 Triton
sparse MLA path by default. Set VLLM_TRITON_MLA_SPARSE_ALLOW_CUDAGRAPH=1 to opt
into the experimental graph-captured path. vLLM compile remains enabled.

Solution: Set VLLM_TRITON_MLA_SPARSE_ALLOW_CUDAGRAPH=1

Unknown SF Transformation Error

Log error:

RuntimeError: Assertion error (csrc/apis/layout.hpp:59): Unknown SF transformation

Cause: support_deep_gemm() returns True for SM120, but DeepGEMM's C++ module does not support SM120's scale factor format conversion.

Solution: Ensure support_deep_gemm() in vllm/platforms/cuda.py excludes SM120:

def support_deep_gemm(cls) -> bool:
    return (cls.is_device_capability(90)
            or cls.is_device_capability_family(100))

Unreleased GPU Memory

ValueError: Free memory on device cuda:0 (2.21/94.97 GiB) on startup is less
than desired GPU memory utilization (0.93, 88.32 GiB).

Solution: Kill residual processes and retry:

ps aux | grep "vllm serve" | grep -v grep | awk '{print $2}' | xargs kill -9
# Or check with nvidia-smi and wait for memory release

liku2001

6 days ago

Cooperate with Claude Code to modify the source code, and deploy DeepSeek using a third-party developer's repository in the article . The throughput has now increased from 5 to 30 tokens per second, close to the normal level.

I will try to use sglang，i am download the fp8 version sglang deepseek weight

liku2001

6 days ago

claude code is good!!：# DeepSeek-V4-Flash-FP8 Deployment and Performance Test Report

Environment

Item	Specification
GPU	8× NVIDIA RTX PRO 6000 Blackwell Server Edition (SM120)
VRAM	~96 GB per card
Model	sgl-project/DeepSeek-V4-Flash-FP8 (274 GB, 46 shards)
Inference Engine	SGLang (lmsysorg/sglang:deepseek-v4-blackwell)
Patch	deepseek-v4-flash-sm120 (SM120 FlashMLA sparse-decode patch)
Container	Docker 29.4.3

Deployment Steps

1. Download Model Weights

# Download via ModelScope (completed in advance)
# Path: Local model cache directory/models/sgl-project/DeepSeek-V4-Flash-FP8/
# Total 46 safetensors shards, 274 GB in total

2. Clone SM120 Patch Repository

git clone https://github.com/0xSero/deepseek-v4-flash-sm120.git
cd deepseek-v4-flash-sm120
git submodule update --init --recursive

3. Pull SGLang Image

docker pull lmsysorg/sglang:deepseek-v4-blackwell

4. Compile SM120 CUDA Extension

# Compile inside SGLang container; outputs saved to build-docker/
scripts/build_in_sglang_docker.sh

Build Artifacts:

build-docker/sitecustomize.py
build-docker/deepseek_v4_kernel/cuda.cpython-312-x86_64-linux-gnu.so
build-docker/deepseek_v4_kernel/_patch.py
etc.

5. Launch Service

docker run \
  --name sglang-dsv4 \
  --gpus all \
  --privileged \
  --shm-size=64g \
  --ipc=host \
  --ulimit memlock=-1 --ulimit stack=67108864 \
  --network host \
  -v /custom-model-dir:/workspace/model:ro \
  -v /patch-project-dir/build-docker:/dsv4:ro \
  -e PYTHONPATH=/dsv4 \
  -e CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
  -e SGLANG_ENABLE_THINKING=1 \
  -e SGLANG_REASONING_EFFORT=max \
  lmsysorg/sglang:deepseek-v4-blackwell \
  python3 -m sglang.launch_server \
    --model-path /workspace/model \
    --host 0.0.0.0 \
    --port 8000 \
    --served-model-name deepseek-v4-flash \
    --trust-remote-code \
    --tensor-parallel-size 8 \
    --context-length 393216 \
    --mem-fraction-static 0.85 \
    --max-running-requests 16 \
    --kv-cache-dtype fp8_e4m3 \
    --tool-call-parser deepseekv4 \
    --reasoning-parser deepseek-v4 \
    --attention-backend compressed \
    --fp8-gemm-backend triton \
    --moe-runner-backend triton \
    --chunked-prefill-size 8192 \
    --watchdog-timeout 3600 \
    --page-size 256 \
    --speculative-algorithm EAGLE \
    --speculative-num-steps 1 \
    --speculative-eagle-topk 1 \
    --speculative-num-draft-tokens 2 \
    --speculative-attention-mode decode \
    --cuda-graph-max-bs 32 \
    --enable-return-routed-experts

6. Service Validation

curl -s http://127.0.0.1:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"model":"deepseek-v4-flash","temperature":0,"max_tokens":32,
       "messages":[{"role":"user","content":"Say OK only."}]}'

Key Parameter Description

Parameter	Value	Description
tensor-parallel-size	8	8-card tensor parallelism
context-length	393216	Maximum context length
kv-cache-dtype	fp8_e4m3	FP8 precision for KV cache
speculative-algorithm	EAGLE	Speculative decoding acceleration
speculative-num-draft-tokens	2	2 draft tokens per step
cuda-graph-max-bs	32	Max batch size for CUDA Graph
page-size	256	Page size of KV cache
mem-fraction-static	0.85	Static memory allocation ratio

Performance Test Results

Chat Scenario Benchmark

Scenario	Input Tok	Output Tok	Latency	Decode Speed
Simple Q&A	11	55	1.47s	37.4 tok/s
Chinese Writing	13	199	4.26s	46.7 tok/s
Code Generation	16	256	5.58s	45.8 tok/s
Logical Reasoning	60	256	6.60s	38.8 tok/s
Multi-turn Dialogue	35	256	5.19s	49.4 tok/s

In short-context chat scenarios, the decoding speed ranges roughly from 37 to 49 tok/s with good output quality.

Context Length vs Throughput

Target Context	Actual Input Tok	Output Tok	Total Latency	Prefill Speed (Overall)
4K	4,014	60	10.68s	376 tok/s
8K	8,014	57	5.65s	1,419 tok/s
16K	16,014	52	7.99s	2,003 tok/s
32K	32,014	57	16.52s	1,938 tok/s

The optimal Prefill speed reaches ~2000 tok/s at 16K–32K context length.

EAGLE Speculative Decoding

Round	Speed
1	49.9 tok/s
2	44.8 tok/s
3	41.1 tok/s
4	45.7 tok/s
5	46.6 tok/s
Average	45.6 tok/s

With 2 draft tokens enabled, EAGLE speculative decoding delivers a stable average speed of ~45 tok/s in short dialogue scenarios.

Issues & Notes

SM120 Patch Mounting

The built-in FlashMLA in official SGLang images only supports SM90/SM100. The SM120 patch must be injected via:

Read-only mount of compiled artifacts: -v ./build-docker:/dsv4:ro
Add patch to Python path: -e PYTHONPATH=/dsv4

Successful validation log:
deepseek_v4_kernel.patch_flash_mla installed (device SM 12.0)

W8A8 / MoE Performance Warning

The following warning repeatedly appears in startup logs:

Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal!
Using default MoE kernel config. Performance might be sub-optimal!

RTX PRO 6000 Blackwell is new hardware with no official optimized preset config integrated into SGLang yet. It does not affect inference correctness, only peak token throughput in batch decoding. Custom optimized configs can be generated via benchmarking for further tuning.

Reasoning Output Format

When SGLANG_ENABLE_THINKING=1 is enabled, model reasoning content is wrapped in specific tags. Currently the reasoning_content field is not parsed separately, and reasoning text is mixed into the standard content field. Further updates to SGLang tokenizer are required to fix this.

Long Context Decode Performance

Decode speed drops significantly to 3–5 tok/s at 64K+ context length, mainly due to:

The current SM120 sparse-decode patch prioritizes functional correctness over raw performance
Future optimization via split-KV / multi-CTA sparse decode kernels
Dedicated tile configuration tuning for W8A8/MoE on RTX PRO 6000 Blackwell

Conclusion

DeepSeek-V4-Flash-FP8 is successfully deployed and operational on 8× RTX PRO 6000 Blackwell. It delivers 40–50 tok/s decoding speed for daily chat workloads and up to 2000 tok/s Prefill throughput for long contexts. The main bottleneck lies in long-context decoding performance, which requires further low-level kernel optimization.

swhua

6 days ago

DeepSeek-V4-Flash Deployment Guide for SM120 (RTX PRO 6000 Blackwell)

Environment Information

GPU: 8× NVIDIA RTX PRO 6000 Blackwell Server Edition (SM120, compute capability 12.0)

CUDA: 12.8+

Python: 3.10

Framework: vLLM (jasl/ds4-sm120-preview branch)

Model: DeepSeek-V4-Flash (FP8)

1. Environment Preparation
# Navigate to the working directory
cd /home/guest/vllm-sm120-git-dir/vllm

# Activate the virtual environment
source .venv/bin/activate

# Verify the current branch
git branch --show-current
# Expected output: ds4-sm120-preview
Critical Note: You must use the ds4-sm120-preview branch. The sm120-full branch lacks the SM120 Triton fallback code required for deployment.

2. Compile vLLM
# Set the DeepGEMM source path (jasl's fork with SM120 support)
export DEEPGEMM_SRC_DIR=/home/guest/vllm-sm120-git-dir/DeepGEMM

# Compile and install vLLM
MAX_JOBS=64 pip install --no-build-isolation -e . --verbose

where is the branch ?

liku2001

4 days ago

•

edited 4 days ago

DeepSeek-V4-Flash Deployment Guide for SM120 (RTX PRO 6000 Blackwell)

Environment Information

GPU: 8× NVIDIA RTX PRO 6000 Blackwell Server Edition (SM120, compute capability 12.0)

CUDA: 12.8+

Python: 3.10

Framework: vLLM (jasl/ds4-sm120-preview branch)

Model: DeepSeek-V4-Flash (FP8)

1. Environment Preparation
# Navigate to the working directory
cd /home/guest/vllm-sm120-git-dir/vllm

# Activate the virtual environment
source .venv/bin/activate

# Verify the current branch
git branch --show-current
# Expected output: ds4-sm120-preview
Critical Note: You must use the ds4-sm120-preview branch. The sm120-full branch lacks the SM120 Triton fallback code required for deployment.

2. Compile vLLM
# Set the DeepGEMM source path (jasl's fork with SM120 support)
export DEEPGEMM_SRC_DIR=/home/guest/vllm-sm120-git-dir/DeepGEMM

# Compile and install vLLM
MAX_JOBS=64 pip install --no-build-isolation -e . --verbose
where is the branch ?

follow the article： https://zhuanlan.zhihu.com/p/2031484558114337285 and https://github.com/deepseek-ai/DeepGEMM/pull/318 ， the branch is from：

git clone https://github.com/jasl/vllm.git
cd vllm

git remote add jasl https://github.com/jasl/vllm.git

git fetch jasl

git checkout -b ds4-sm120 jasl/ds4-sm120

however，I recommend to use sglang，more easily。

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment