|
# Step3 Model Deployment Guide |
|
|
|
This document provides deployment guidance for Step3 model. |
|
|
|
Currently, our open-source deployment guide only includes TP and DP+TP deployment methods. The AFD (Attn-FFN Disaggregated) approach mentioned in our [paper](https://arxiv.org/abs/2507.19427) is still under joint development with the open-source community to achieve optimal performance. Please stay tuned for updates on our open-source progress. |
|
|
|
## Overview |
|
|
|
Step3 is a 321B-parameter VLM with hardware-aware model-system co-design optimized for minimizing decoding costs. |
|
|
|
For out fp8 version, about 326G memory is required. |
|
The smallest deployment unit for this version is 8xH20 with either Tensor Parallel (TP) or Data Parallel + Tensor Parallel (DP+TP). |
|
|
|
For out bf16 version, about 642G memory is required. |
|
The smallest deployment unit for this version is 16xH20 with either Tensor Parallel (TP) or Data Parallel + Tensor Parallel (DP+TP). |
|
|
|
## Deployment Options |
|
|
|
### vLLM Deployment |
|
|
|
Please make sure to use nightly version of vllm after this [PR](https://github.com/vllm-project/vllm/pull/21998) is merged. For details, please refer to [vllm nightly installation doc](https://docs.vllm.ai/en/latest/getting_started/installation/gpu.html#pre-built-wheels). |
|
```bash |
|
uv pip install -U vllm \ |
|
--torch-backend=auto \ |
|
--extra-index-url https://wheels.vllm.ai/nightly |
|
``` |
|
|
|
We recommend to use the following command to deploy the model: |
|
|
|
**`max_num_batched_tokens` should be larger than 4096. If not set, the default value is 8192.** |
|
|
|
#### BF16 Model |
|
##### Tensor Parallelism(Serving on 16xH20): |
|
|
|
```bash |
|
# start ray on node 0 and node 1 |
|
|
|
# node 0: |
|
vllm serve /path/to/step3 \ |
|
--tensor-parallel-size 16 \ |
|
--reasoning-parser step3 \ |
|
--enable-auto-tool-choice \ |
|
--tool-call-parser step3 \ |
|
--trust-remote-code \ |
|
--max-num-batched-tokens 4096 \ |
|
--port $PORT_SERVING |
|
``` |
|
|
|
###### Data Parallelism + Tensor Parallelism(Serving on 16xH20): |
|
Step3 only has single kv head, so attention data parallelism can be adopted to reduce the kv cache memory usage. |
|
|
|
```bash |
|
# start ray on node 0 and node 1 |
|
|
|
# node 0: |
|
vllm serve /path/to/step3 \ |
|
--data-parallel-size 16 \ |
|
--tensor-parallel-size 1 \ |
|
--reasoning-parser step3 \ |
|
--enable-auto-tool-choice \ |
|
--tool-call-parser step3 \ |
|
--max-num-batched-tokens 4096 \ |
|
--trust-remote-code \ |
|
``` |
|
|
|
#### FP8 Model |
|
##### Tensor Parallelism(Serving on 8xH20): |
|
|
|
```bash |
|
vllm serve /path/to/step3-fp8 \ |
|
--tensor-parallel-size 8 \ |
|
--reasoning-parser step3 \ |
|
--enable-auto-tool-choice \ |
|
--tool-call-parser step3 \ |
|
--gpu-memory-utilization 0.85 \ |
|
--max-num-batched-tokens 4096 \ |
|
--trust-remote-code \ |
|
``` |
|
|
|
###### Data Parallelism + Tensor Parallelism(Serving on 8xH20): |
|
|
|
```bash |
|
vllm serve /path/to/step3-fp8 \ |
|
--data-parallel-size 8 \ |
|
--tensor-parallel-size 1 \ |
|
--reasoning-parser step3 \ |
|
--enable-auto-tool-choice \ |
|
--tool-call-parser step3 \ |
|
--max-num-batched-tokens 4096 \ |
|
--trust-remote-code \ |
|
``` |
|
|
|
|
|
##### Key parameter notes: |
|
|
|
* `reasoning-parser`: If enabled, reasoning content in the response will be parsed into a structured format. |
|
* `tool-call-parser`: If enabled, tool call content in the response will be parsed into a structured format. |
|
|
|
### SGLang Deployment |
|
|
|
0.4.10 or later is needed for SGLang. |
|
|
|
``` |
|
pip3 install "sglang[all]>=0.4.10" |
|
``` |
|
|
|
#### BF16 Model |
|
##### Tensor Parallelism(Serving on 16xH20): |
|
|
|
```bash |
|
# node 1 |
|
python -m sglang.launch_server \ |
|
--model-path stepfun-ai/step3 \ |
|
--dist-init-addr master_ip:5000 \ |
|
--trust-remote-code \ |
|
--tool-call-parser step3 \ |
|
--reasoning-parser step3 \ |
|
--tp 16 \ |
|
--nnodes 2 \ |
|
--node-rank 0 |
|
|
|
# node 2 |
|
python -m sglang.launch_server \ |
|
--model-path stepfun-ai/step3 \ |
|
--dist-init-addr master_ip:5000 \ |
|
--trust-remote-code \ |
|
--tool-call-parser step3 \ |
|
--reasoning-parser step3 \ |
|
--tp 16 \ |
|
--nnodes 2 \ |
|
--node-rank 1 |
|
``` |
|
|
|
#### FP8 Model |
|
##### Tensor Parallelism(Serving on 8xH20): |
|
|
|
```bash |
|
python -m sglang.launch_server \ |
|
--model-path /path/to/step3-fp8 \ |
|
--trust-remote-code \ |
|
--tool-call-parser step3 \ |
|
--reasoning-parser step3 \ |
|
--tp 8 |
|
``` |
|
|
|
|
|
### TensorRT-LLM Deployment |
|
|
|
[Coming soon...] |
|
|
|
|
|
## Client Request Examples |
|
|
|
Then you can use the chat API as below: |
|
```python |
|
from openai import OpenAI |
|
|
|
# Set OpenAI's API key and API base to use vLLM's API server. |
|
openai_api_key = "EMPTY" |
|
openai_api_base = "http://localhost:8000/v1" |
|
|
|
client = OpenAI( |
|
api_key=openai_api_key, |
|
base_url=openai_api_base, |
|
) |
|
|
|
chat_response = client.chat.completions.create( |
|
model="step3", |
|
messages=[ |
|
{"role": "system", "content": "You are a helpful assistant."}, |
|
{ |
|
"role": "user", |
|
"content": [ |
|
{ |
|
"type": "image_url", |
|
"image_url": { |
|
"url": "https://xxxxx.png" |
|
}, |
|
}, |
|
{"type": "text", "text": "Please describe the image."}, |
|
], |
|
}, |
|
], |
|
) |
|
print("Chat response:", chat_response) |
|
``` |
|
You can also upload base64-encoded local images: |
|
|
|
```python |
|
import base64 |
|
from openai import OpenAI |
|
# Set OpenAI's API key and API base to use vLLM's API server. |
|
openai_api_key = "EMPTY" |
|
openai_api_base = "http://localhost:8000/v1" |
|
client = OpenAI( |
|
api_key=openai_api_key, |
|
base_url=openai_api_base, |
|
) |
|
image_path = "/path/to/local/image.png" |
|
with open(image_path, "rb") as f: |
|
encoded_image = base64.b64encode(f.read()) |
|
encoded_image_text = encoded_image.decode("utf-8") |
|
base64_step = f"data:image;base64,{encoded_image_text}" |
|
chat_response = client.chat.completions.create( |
|
model="step3", |
|
messages=[ |
|
{"role": "system", "content": "You are a helpful assistant."}, |
|
{ |
|
"role": "user", |
|
"content": [ |
|
{ |
|
"type": "image_url", |
|
"image_url": { |
|
"url": base64_step |
|
}, |
|
}, |
|
{"type": "text", "text": "Please describe the image."}, |
|
], |
|
}, |
|
], |
|
) |
|
print("Chat response:", chat_response) |
|
|
|
``` |
|
|
|
Note: In our image preprocessing pipeline, we implement a multi-patch mechanism to handle large images. If the input image exceeds 728x728 pixels, the system will automatically apply image cropping logic to get patches of the image. |