step3-fp8 / docs /deploy_guidance.md

Rico

[UPDATE] update deploy guidance

86eb322 14 days ago

6.4 kB

	# Step3 Model Deployment Guide

	This document provides deployment guidance for Step3 model.

	Currently, our open-source deployment guide only includes TP and DP+TP deployment methods. The AFD (Attn-FFN Disaggregated) approach mentioned in our [paper](https://arxiv.org/abs/2507.19427) is still under joint development with the open-source community to achieve optimal performance. Please stay tuned for updates on our open-source progress.

	## Overview

	Step3 is a 321B-parameter VLM with hardware-aware model-system co-design optimized for minimizing decoding costs.

	For out fp8 version, about 326G memory is required.
	The smallest deployment unit for this version is 8xH20 with either Tensor Parallel (TP) or Data Parallel + Tensor Parallel (DP+TP).

	For out bf16 version, about 642G memory is required.
	The smallest deployment unit for this version is 16xH20 with either Tensor Parallel (TP) or Data Parallel + Tensor Parallel (DP+TP).

	## Deployment Options

	### vLLM Deployment

	Please make sure to use nightly version of vllm after this [PR](https://github.com/vllm-project/vllm/pull/21998) is merged. For details, please refer to [vllm nightly installation doc](https://docs.vllm.ai/en/latest/getting_started/installation/gpu.html#pre-built-wheels).
	```bash
	uv pip install -U vllm \
	--torch-backend=auto \
	--extra-index-url https://wheels.vllm.ai/nightly
	```

	We recommend to use the following command to deploy the model:

	`max_num_batched_tokens` should be larger than 4096. If not set, the default value is 8192.

	#### BF16 Model
	##### Tensor Parallelism(Serving on 16xH20):

	```bash
	# start ray on node 0 and node 1

	# node 0:
	vllm serve /path/to/step3 \
	--tensor-parallel-size 16 \
	--reasoning-parser step3 \
	--enable-auto-tool-choice \
	--tool-call-parser step3 \
	--trust-remote-code \
	--max-num-batched-tokens 4096 \
	--port $PORT_SERVING
	```

	###### Data Parallelism + Tensor Parallelism(Serving on 16xH20):
	Step3 only has single kv head, so attention data parallelism can be adopted to reduce the kv cache memory usage.

	```bash
	# start ray on node 0 and node 1

	# node 0:
	vllm serve /path/to/step3 \
	--data-parallel-size 16 \
	--tensor-parallel-size 1 \
	--reasoning-parser step3 \
	--enable-auto-tool-choice \
	--tool-call-parser step3 \
	--max-num-batched-tokens 4096 \
	--trust-remote-code \
	```

	#### FP8 Model
	##### Tensor Parallelism(Serving on 8xH20):

	```bash
	vllm serve /path/to/step3-fp8 \
	--tensor-parallel-size 8 \
	--reasoning-parser step3 \
	--enable-auto-tool-choice \
	--tool-call-parser step3 \
	--gpu-memory-utilization 0.85 \
	--max-num-batched-tokens 4096 \
	--trust-remote-code \
	```

	###### Data Parallelism + Tensor Parallelism(Serving on 8xH20):

	```bash
	vllm serve /path/to/step3-fp8 \
	--data-parallel-size 8 \
	--tensor-parallel-size 1 \
	--reasoning-parser step3 \
	--enable-auto-tool-choice \
	--tool-call-parser step3 \
	--max-num-batched-tokens 4096 \
	--trust-remote-code \
	```


	##### Key parameter notes:

	* `reasoning-parser`: If enabled, reasoning content in the response will be parsed into a structured format.
	* `tool-call-parser`: If enabled, tool call content in the response will be parsed into a structured format.

	### SGLang Deployment

	0.4.10 or later is needed for SGLang.

	```
	pip3 install "sglang[all]>=0.4.10"
	```

	#### BF16 Model
	##### Tensor Parallelism(Serving on 16xH20):

	```bash
	# node 1
	python -m sglang.launch_server \
	--model-path stepfun-ai/step3 \
	--dist-init-addr master_ip:5000 \
	--trust-remote-code \
	--tool-call-parser step3 \
	--reasoning-parser step3 \
	--tp 16 \
	--nnodes 2 \
	--node-rank 0

	# node 2
	python -m sglang.launch_server \
	--model-path stepfun-ai/step3 \
	--dist-init-addr master_ip:5000 \
	--trust-remote-code \
	--tool-call-parser step3 \
	--reasoning-parser step3 \
	--tp 16 \
	--nnodes 2 \
	--node-rank 1
	```

	#### FP8 Model
	##### Tensor Parallelism(Serving on 8xH20):

	```bash
	python -m sglang.launch_server \
	--model-path /path/to/step3-fp8 \
	--trust-remote-code \
	--tool-call-parser step3 \
	--reasoning-parser step3 \
	--tp 8
	```


	### TensorRT-LLM Deployment

	[Coming soon...]


	## Client Request Examples

	Then you can use the chat API as below:
	```python
	from openai import OpenAI

	# Set OpenAI's API key and API base to use vLLM's API server.
	openai_api_key = "EMPTY"
	openai_api_base = "http://localhost:8000/v1"

	client = OpenAI(
	api_key=openai_api_key,
	base_url=openai_api_base,
	)

	chat_response = client.chat.completions.create(
	model="step3",
	messages=[
	{"role": "system", "content": "You are a helpful assistant."},
	{
	"role": "user",
	"content": [
	{
	"type": "image_url",
	"image_url": {
	"url": "https://xxxxx.png"
	},
	},
	{"type": "text", "text": "Please describe the image."},
	],
	},
	],
	)
	print("Chat response:", chat_response)
	```
	You can also upload base64-encoded local images:

	```python
	import base64
	from openai import OpenAI
	# Set OpenAI's API key and API base to use vLLM's API server.
	openai_api_key = "EMPTY"
	openai_api_base = "http://localhost:8000/v1"
	client = OpenAI(
	api_key=openai_api_key,
	base_url=openai_api_base,
	)
	image_path = "/path/to/local/image.png"
	with open(image_path, "rb") as f:
	encoded_image = base64.b64encode(f.read())
	encoded_image_text = encoded_image.decode("utf-8")
	base64_step = f"data:image;base64,{encoded_image_text}"
	chat_response = client.chat.completions.create(
	model="step3",
	messages=[
	{"role": "system", "content": "You are a helpful assistant."},
	{
	"role": "user",
	"content": [
	{
	"type": "image_url",
	"image_url": {
	"url": base64_step
	},
	},
	{"type": "text", "text": "Please describe the image."},
	],
	},
	],
	)
	print("Chat response:", chat_response)

	```

	Note: In our image preprocessing pipeline, we implement a multi-patch mechanism to handle large images. If the input image exceeds 728x728 pixels, the system will automatically apply image cropping logic to get patches of the image.