Instructions to use deepseek-ai/DeepSeek-V4-Flash with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use deepseek-ai/DeepSeek-V4-Flash with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="deepseek-ai/DeepSeek-V4-Flash") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-V4-Flash") model = AutoModelForCausalLM.from_pretrained("deepseek-ai/DeepSeek-V4-Flash") - Inference
- HuggingChat
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use deepseek-ai/DeepSeek-V4-Flash with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "deepseek-ai/DeepSeek-V4-Flash" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "deepseek-ai/DeepSeek-V4-Flash", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/deepseek-ai/DeepSeek-V4-Flash
- SGLang
How to use deepseek-ai/DeepSeek-V4-Flash with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "deepseek-ai/DeepSeek-V4-Flash" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "deepseek-ai/DeepSeek-V4-Flash", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "deepseek-ai/DeepSeek-V4-Flash" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "deepseek-ai/DeepSeek-V4-Flash", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use deepseek-ai/DeepSeek-V4-Flash with Docker Model Runner:
docker model run hf.co/deepseek-ai/DeepSeek-V4-Flash
Running models with vLLM on the RTX Pro 6000 - SM120
I plan to refer to https://zhuanlan.zhihu.com/p/2031484558114337285 and https://github.com/deepseek-ai/DeepGEMM/pull/318,and I will try to verify this tomorrow.
So does the official vLLM officially support SM120 now? I really don't want to build vLLM from source.
Excited to see your verification results! SM120 support is exactly what I've been waiting for.
follow the article,After a full day of tough trial and error, I successfully compiled vLLM, launched DeepSeek, and completed interactive conversations. Unfortunately, the performance fell short of what was stated in the article — I only get 5 tokens per second on my setup with 8 RTX Pro 6000 GPUs.
Prior to this, I tried launching with the latest vLLM version v0.20.2, but it prompted that FlashAttention could not be found. Mimo v2.5 also failed to start up on SM120 GPU. I may switch to SGLang for testing tomorrow.
4x RTX PRO 6000 - sglang - 400k total context - 105 peak tok/s decode
DeepSeek-V4-Flash Deployment Guide for SM120 (RTX PRO 6000 Blackwell)
Environment Information
- GPU: 8× NVIDIA RTX PRO 6000 Blackwell Server Edition (SM120, compute capability 12.0)
- CUDA: 12.8+
- Python: 3.10
- Framework: vLLM (jasl/ds4-sm120-preview branch)
- Model: DeepSeek-V4-Flash (FP8)
1. Environment Preparation
# Navigate to the working directory
cd /home/guest/vllm-sm120-git-dir/vllm
# Activate the virtual environment
source .venv/bin/activate
# Verify the current branch
git branch --show-current
# Expected output: ds4-sm120-preview
Critical Note: You must use the
ds4-sm120-previewbranch. Thesm120-fullbranch lacks the SM120 Triton fallback code required for deployment.
2. Compile vLLM
# Set the DeepGEMM source path (jasl's fork with SM120 support)
export DEEPGEMM_SRC_DIR=/home/guest/vllm-sm120-git-dir/DeepGEMM
# Compile and install vLLM
MAX_JOBS=64 pip install --no-build-isolation -e . --verbose
3. Start the Service
Recommended Startup Command
source .venv/bin/activate
CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
VLLM_TRITON_MLA_SPARSE=1 \
VLLM_TRITON_MLA_SPARSE_TOPK_CHUNK_SIZE=256 \
VLLM_TRITON_MLA_SPARSE_QUERY_CHUNK_SIZE=128 \
VLLM_TRITON_MLA_SPARSE_ALLOW_CUDAGRAPH=1 \
nohup vllm serve /root/.cache/modelscope/hub/models/deepseek-ai/DeepSeek-V4-flash \
--trust-remote-code \
--kv-cache-dtype fp8 \
--block-size 256 \
--tensor-parallel-size 8 \
--tokenizer-mode deepseek_v4 \
--tool-call-parser deepseek_v4 \
--enable-auto-tool-choice \
--reasoning-parser deepseek_v4 \
--host 0.0.0.0 --port 8005 \
--served-model-name DeepSeek-V4-flash \
--gpu-memory-utilization 0.93 \
--max-num-seqs 4 \
--max-model-len 262144 \
--async-scheduling \
--enable-prefix-caching \
--load-format auto \
--pipeline-parallel-size 1 \
--enable-expert-parallel \
--compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE"}' > /tmp/vllm_server.log 2>&1 &
Environment Variable Explanations
| Variable | Function | SM120 Default Value |
|---|---|---|
VLLM_TRITON_MLA_SPARSE=1 |
Enables Triton sparse MLA attention | Auto-detected (enabled by default for SM120) |
VLLM_TRITON_MLA_SPARSE_TOPK_CHUNK_SIZE=256 |
TopK chunk size | 512 |
VLLM_TRITON_MLA_SPARSE_QUERY_CHUNK_SIZE=128 |
Query chunk size | 256 |
VLLM_TRITON_MLA_SPARSE_ALLOW_CUDAGRAPH=1 |
Enables CUDA graphs | 0 (disabled by default) |
These environment variables are INVALID for SM120 – DO NOT use:
VLLM_SM120_REFERENCE_DEEPSEEK_V4_ATTENTION— Non-existentVLLM_SM120_REFERENCE_TOPK_CHUNK_SIZE— Non-existentVLLM_SM120_REFERENCE_QUERY_CHUNK_SIZE— Non-existentVLLM_ATTENTION_BACKEND=FLASH_ATTN_2— Invalid value; SM120 automatically usesTRITON_MLA
4. Testing
Chat Functionality Test
curl -s http://localhost:8005/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "DeepSeek-V4-flash",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 100,
"temperature": 0.7
}' | python3 -m json.tool
Streaming Performance Test
curl -s http://localhost:8005/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "DeepSeek-V4-flash",
"messages": [{"role": "user", "content": "Write a detailed paragraph about AI."}],
"max_tokens": 512,
"temperature": 0.7,
"stream": true
}' 2>/dev/null | python3 -c "
import sys, time
start = time.time()
tokens = 0
for line in sys.stdin:
if line.startswith('data: ') and line.strip() != 'data: [DONE]':
import json
data = json.loads(line[6:])
choices = data.get('choices', [])
if choices and choices[0].get('delta', {}).get('content'):
tokens += 1
sys.stdout.write(choices[0]['delta']['content'])
sys.stdout.flush()
elapsed = time.time() - start
print(f'\n\n--- Performance ---')
print(f'Generated {tokens} tokens in {elapsed:.2f}s')
print(f'Throughput: {tokens/elapsed:.2f} tok/s')
"
5. Actual Performance Results
| Configuration | Throughput |
|---|---|
| No CUDA graphs + invalid env vars | ~5 tok/s |
| CUDA graphs enabled + correct env vars | 30-35 tok/s |
Key Optimization Points
- CUDA graphs are the most critical optimization —
VLLM_TRITON_MLA_SPARSE_ALLOW_CUDAGRAPH=1must be set. Otherwise,sparse_mla_env.pywill forcibly disable CUDA graphs on SM120. - DeepGEMM must NOT be enabled on SM120 —
support_deep_gemm()must only returnTruefor SM90/SM100. While DeepGEMM imports successfully, thetransform_sf_into_required_layoutfunction incsrc/apis/layout.hpponly supports arch_major 9 and 10; arch_major 12 triggers aDG_HOST_UNREACHABLE("Unknown SF transformation")error and crashes model loading. - SM120 attention path — Implemented via Triton/PyTorch fallbacks in
sm12x_deep_gemm_fallbacks.pyandsm12x_mqa.py, independent of DeepGEMM-compiled CUDA kernels.
Known Performance Limitations
| Limitation | Root Cause |
|---|---|
| Custom allreduce disabled | 8 PCIe GPUs without NVLink (world_size > 2 && !fully_connected) |
| SymmMemCommunicator unavailable | SM120 (12.0) is not in the supported list |
| W8A8 Block FP8 uses default config | No optimized profile for NVIDIA_RTX_PRO_6000_Blackwell_Server_Edition |
| Communication via NCCL over PCIe | Limited cross-GPU communication bandwidth without NVLink |
6. Troubleshooting
CUDA Graphs Disabled
Log warning:
WARNING sparse_mla_env.py:101 Disabling CUDA graphs for the DeepSeek V4 Triton
sparse MLA path by default. Set VLLM_TRITON_MLA_SPARSE_ALLOW_CUDAGRAPH=1 to opt
into the experimental graph-captured path. vLLM compile remains enabled.
Solution: Set VLLM_TRITON_MLA_SPARSE_ALLOW_CUDAGRAPH=1
Unknown SF Transformation Error
Log error:
RuntimeError: Assertion error (csrc/apis/layout.hpp:59): Unknown SF transformation
Cause: support_deep_gemm() returns True for SM120, but DeepGEMM's C++ module does not support SM120's scale factor format conversion.
Solution: Ensure support_deep_gemm() in vllm/platforms/cuda.py excludes SM120:
def support_deep_gemm(cls) -> bool:
return (cls.is_device_capability(90)
or cls.is_device_capability_family(100))
Unreleased GPU Memory
ValueError: Free memory on device cuda:0 (2.21/94.97 GiB) on startup is less
than desired GPU memory utilization (0.93, 88.32 GiB).
Solution: Kill residual processes and retry:
ps aux | grep "vllm serve" | grep -v grep | awk '{print $2}' | xargs kill -9
# Or check with nvidia-smi and wait for memory release
Cooperate with Claude Code to modify the source code, and deploy DeepSeek using a third-party developer's repository in the article . The throughput has now increased from 5 to 30 tokens per second, close to the normal level.
I will try to use sglang,i am download the fp8 version sglang deepseek weight
claude code is good!!:# DeepSeek-V4-Flash-FP8 Deployment and Performance Test Report
Environment
| Item | Specification |
|---|---|
| GPU | 8× NVIDIA RTX PRO 6000 Blackwell Server Edition (SM120) |
| VRAM | ~96 GB per card |
| Model | sgl-project/DeepSeek-V4-Flash-FP8 (274 GB, 46 shards) |
| Inference Engine | SGLang (lmsysorg/sglang:deepseek-v4-blackwell) |
| Patch | deepseek-v4-flash-sm120 (SM120 FlashMLA sparse-decode patch) |
| Container | Docker 29.4.3 |
Deployment Steps
1. Download Model Weights
# Download via ModelScope (completed in advance)
# Path: Local model cache directory/models/sgl-project/DeepSeek-V4-Flash-FP8/
# Total 46 safetensors shards, 274 GB in total
2. Clone SM120 Patch Repository
git clone https://github.com/0xSero/deepseek-v4-flash-sm120.git
cd deepseek-v4-flash-sm120
git submodule update --init --recursive
3. Pull SGLang Image
docker pull lmsysorg/sglang:deepseek-v4-blackwell
4. Compile SM120 CUDA Extension
# Compile inside SGLang container; outputs saved to build-docker/
scripts/build_in_sglang_docker.sh
Build Artifacts:
build-docker/sitecustomize.pybuild-docker/deepseek_v4_kernel/cuda.cpython-312-x86_64-linux-gnu.sobuild-docker/deepseek_v4_kernel/_patch.py- etc.
5. Launch Service
docker run \
--name sglang-dsv4 \
--gpus all \
--privileged \
--shm-size=64g \
--ipc=host \
--ulimit memlock=-1 --ulimit stack=67108864 \
--network host \
-v /custom-model-dir:/workspace/model:ro \
-v /patch-project-dir/build-docker:/dsv4:ro \
-e PYTHONPATH=/dsv4 \
-e CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
-e SGLANG_ENABLE_THINKING=1 \
-e SGLANG_REASONING_EFFORT=max \
lmsysorg/sglang:deepseek-v4-blackwell \
python3 -m sglang.launch_server \
--model-path /workspace/model \
--host 0.0.0.0 \
--port 8000 \
--served-model-name deepseek-v4-flash \
--trust-remote-code \
--tensor-parallel-size 8 \
--context-length 393216 \
--mem-fraction-static 0.85 \
--max-running-requests 16 \
--kv-cache-dtype fp8_e4m3 \
--tool-call-parser deepseekv4 \
--reasoning-parser deepseek-v4 \
--attention-backend compressed \
--fp8-gemm-backend triton \
--moe-runner-backend triton \
--chunked-prefill-size 8192 \
--watchdog-timeout 3600 \
--page-size 256 \
--speculative-algorithm EAGLE \
--speculative-num-steps 1 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 2 \
--speculative-attention-mode decode \
--cuda-graph-max-bs 32 \
--enable-return-routed-experts
6. Service Validation
curl -s http://127.0.0.1:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{"model":"deepseek-v4-flash","temperature":0,"max_tokens":32,
"messages":[{"role":"user","content":"Say OK only."}]}'
Key Parameter Description
| Parameter | Value | Description |
|---|---|---|
| tensor-parallel-size | 8 | 8-card tensor parallelism |
| context-length | 393216 | Maximum context length |
| kv-cache-dtype | fp8_e4m3 | FP8 precision for KV cache |
| speculative-algorithm | EAGLE | Speculative decoding acceleration |
| speculative-num-draft-tokens | 2 | 2 draft tokens per step |
| cuda-graph-max-bs | 32 | Max batch size for CUDA Graph |
| page-size | 256 | Page size of KV cache |
| mem-fraction-static | 0.85 | Static memory allocation ratio |
Performance Test Results
Chat Scenario Benchmark
| Scenario | Input Tok | Output Tok | Latency | Decode Speed |
|---|---|---|---|---|
| Simple Q&A | 11 | 55 | 1.47s | 37.4 tok/s |
| Chinese Writing | 13 | 199 | 4.26s | 46.7 tok/s |
| Code Generation | 16 | 256 | 5.58s | 45.8 tok/s |
| Logical Reasoning | 60 | 256 | 6.60s | 38.8 tok/s |
| Multi-turn Dialogue | 35 | 256 | 5.19s | 49.4 tok/s |
In short-context chat scenarios, the decoding speed ranges roughly from 37 to 49 tok/s with good output quality.
Context Length vs Throughput
| Target Context | Actual Input Tok | Output Tok | Total Latency | Prefill Speed (Overall) |
|---|---|---|---|---|
| 4K | 4,014 | 60 | 10.68s | 376 tok/s |
| 8K | 8,014 | 57 | 5.65s | 1,419 tok/s |
| 16K | 16,014 | 52 | 7.99s | 2,003 tok/s |
| 32K | 32,014 | 57 | 16.52s | 1,938 tok/s |
The optimal Prefill speed reaches ~2000 tok/s at 16K–32K context length.
EAGLE Speculative Decoding
| Round | Speed |
|---|---|
| 1 | 49.9 tok/s |
| 2 | 44.8 tok/s |
| 3 | 41.1 tok/s |
| 4 | 45.7 tok/s |
| 5 | 46.6 tok/s |
| Average | 45.6 tok/s |
With 2 draft tokens enabled, EAGLE speculative decoding delivers a stable average speed of ~45 tok/s in short dialogue scenarios.
Issues & Notes
SM120 Patch Mounting
The built-in FlashMLA in official SGLang images only supports SM90/SM100. The SM120 patch must be injected via:
- Read-only mount of compiled artifacts:
-v ./build-docker:/dsv4:ro - Add patch to Python path:
-e PYTHONPATH=/dsv4
Successful validation log:deepseek_v4_kernel.patch_flash_mla installed (device SM 12.0)
W8A8 / MoE Performance Warning
The following warning repeatedly appears in startup logs:
Using default W8A8 Block FP8 kernel config. Performance might be sub-optimal!
Using default MoE kernel config. Performance might be sub-optimal!
RTX PRO 6000 Blackwell is new hardware with no official optimized preset config integrated into SGLang yet. It does not affect inference correctness, only peak token throughput in batch decoding. Custom optimized configs can be generated via benchmarking for further tuning.
Reasoning Output Format
When SGLANG_ENABLE_THINKING=1 is enabled, model reasoning content is wrapped in specific tags. Currently the reasoning_content field is not parsed separately, and reasoning text is mixed into the standard content field. Further updates to SGLang tokenizer are required to fix this.
Long Context Decode Performance
Decode speed drops significantly to 3–5 tok/s at 64K+ context length, mainly due to:
- The current SM120 sparse-decode patch prioritizes functional correctness over raw performance
- Future optimization via split-KV / multi-CTA sparse decode kernels
- Dedicated tile configuration tuning for W8A8/MoE on RTX PRO 6000 Blackwell
Conclusion
DeepSeek-V4-Flash-FP8 is successfully deployed and operational on 8× RTX PRO 6000 Blackwell. It delivers 40–50 tok/s decoding speed for daily chat workloads and up to 2000 tok/s Prefill throughput for long contexts. The main bottleneck lies in long-context decoding performance, which requires further low-level kernel optimization.
DeepSeek-V4-Flash Deployment Guide for SM120 (RTX PRO 6000 Blackwell)
Environment Information
- GPU: 8× NVIDIA RTX PRO 6000 Blackwell Server Edition (SM120, compute capability 12.0)
- CUDA: 12.8+
- Python: 3.10
- Framework: vLLM (jasl/ds4-sm120-preview branch)
- Model: DeepSeek-V4-Flash (FP8)
1. Environment Preparation
# Navigate to the working directory cd /home/guest/vllm-sm120-git-dir/vllm # Activate the virtual environment source .venv/bin/activate # Verify the current branch git branch --show-current # Expected output: ds4-sm120-previewCritical Note: You must use the
ds4-sm120-previewbranch. Thesm120-fullbranch lacks the SM120 Triton fallback code required for deployment.
2. Compile vLLM
# Set the DeepGEMM source path (jasl's fork with SM120 support) export DEEPGEMM_SRC_DIR=/home/guest/vllm-sm120-git-dir/DeepGEMM # Compile and install vLLM MAX_JOBS=64 pip install --no-build-isolation -e . --verbose
where is the branch ?
DeepSeek-V4-Flash Deployment Guide for SM120 (RTX PRO 6000 Blackwell)
Environment Information
- GPU: 8× NVIDIA RTX PRO 6000 Blackwell Server Edition (SM120, compute capability 12.0)
- CUDA: 12.8+
- Python: 3.10
- Framework: vLLM (jasl/ds4-sm120-preview branch)
- Model: DeepSeek-V4-Flash (FP8)
1. Environment Preparation
# Navigate to the working directory cd /home/guest/vllm-sm120-git-dir/vllm # Activate the virtual environment source .venv/bin/activate # Verify the current branch git branch --show-current # Expected output: ds4-sm120-previewCritical Note: You must use the
ds4-sm120-previewbranch. Thesm120-fullbranch lacks the SM120 Triton fallback code required for deployment.
2. Compile vLLM
# Set the DeepGEMM source path (jasl's fork with SM120 support) export DEEPGEMM_SRC_DIR=/home/guest/vllm-sm120-git-dir/DeepGEMM # Compile and install vLLM MAX_JOBS=64 pip install --no-build-isolation -e . --verbosewhere is the branch ?
follow the article: https://zhuanlan.zhihu.com/p/2031484558114337285 and https://github.com/deepseek-ai/DeepGEMM/pull/318 , the branch is from:
git clone https://github.com/jasl/vllm.git
cd vllm
git remote add jasl https://github.com/jasl/vllm.git
git fetch jasl
git checkout -b ds4-sm120 jasl/ds4-sm120
however,I recommend to use sglang,more easily。
