Lekr0 commited on
Commit
6268841
·
verified ·
1 Parent(s): 5513247

Add files using upload-large-folder tool

Browse files
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. sglang/.github/workflows/open-pr-copy-from-oss.yml +28 -0
  2. sglang/.github/workflows/release-branch-cut.yml +213 -0
  3. sglang/.github/workflows/rerun-ut.yml +71 -0
  4. sglang/docs/_static/css/custom_log.css +29 -0
  5. sglang/docs/_static/css/readthedocs.css +9 -0
  6. sglang/docs/_static/image/logo.ico +0 -0
  7. sglang/docs/advanced_features/checkpoint_engine.md +254 -0
  8. sglang/docs/advanced_features/structured_outputs.ipynb +997 -0
  9. sglang/docs/advanced_features/tool_parser.ipynb +856 -0
  10. sglang/docs/advanced_features/vlm_query.ipynb +388 -0
  11. sglang/docs/basic_usage/deepseek_ocr.md +54 -0
  12. sglang/docs/basic_usage/deepseek_v32.md +459 -0
  13. sglang/docs/basic_usage/glm45.md +70 -0
  14. sglang/docs/basic_usage/glmv.md +136 -0
  15. sglang/docs/basic_usage/gpt_oss.md +147 -0
  16. sglang/docs/basic_usage/llama4.md +92 -0
  17. sglang/docs/basic_usage/minimax_m2.md +85 -0
  18. sglang/docs/basic_usage/native_api.ipynb +667 -0
  19. sglang/docs/basic_usage/offline_engine_api.ipynb +235 -0
  20. sglang/docs/basic_usage/ollama_api.md +91 -0
  21. sglang/docs/basic_usage/openai_api.rst +9 -0
  22. sglang/docs/basic_usage/openai_api_completions.ipynb +552 -0
  23. sglang/docs/basic_usage/openai_api_embeddings.ipynb +193 -0
  24. sglang/docs/basic_usage/openai_api_vision.ipynb +252 -0
  25. sglang/docs/basic_usage/popular_model_usage.rst +19 -0
  26. sglang/docs/basic_usage/qwen3.md +39 -0
  27. sglang/docs/basic_usage/qwen3_vl.md +130 -0
  28. sglang/docs/basic_usage/sampling_params.md +347 -0
  29. sglang/docs/basic_usage/send_request.ipynb +251 -0
  30. sglang/docs/developer_guide/bench_serving.md +355 -0
  31. sglang/docs/developer_guide/benchmark_and_profiling.md +467 -0
  32. sglang/docs/developer_guide/contribution_guide.md +147 -0
  33. sglang/docs/developer_guide/development_guide_using_docker.md +108 -0
  34. sglang/docs/developer_guide/development_jit_kernel_guide.md +259 -0
  35. sglang/docs/developer_guide/evaluating_new_models.md +146 -0
  36. sglang/docs/developer_guide/release_process.md +18 -0
  37. sglang/docs/developer_guide/setup_github_runner.md +51 -0
  38. sglang/docs/diffusion/api/cli.md +332 -0
  39. sglang/docs/diffusion/api/openai_api.md +420 -0
  40. sglang/docs/diffusion/ci_perf.md +29 -0
  41. sglang/docs/diffusion/compatibility_matrix.md +78 -0
  42. sglang/docs/diffusion/contributing.md +67 -0
  43. sglang/docs/diffusion/environment_variables.md +36 -0
  44. sglang/docs/diffusion/index.md +98 -0
  45. sglang/docs/diffusion/installation.md +95 -0
  46. sglang/docs/diffusion/performance/attention_backends.md +131 -0
  47. sglang/docs/diffusion/performance/cache/cache_dit.md +273 -0
  48. sglang/docs/diffusion/performance/cache/index.md +60 -0
  49. sglang/docs/diffusion/performance/cache/teacache.md +84 -0
  50. sglang/docs/diffusion/performance/index.md +72 -0
sglang/.github/workflows/open-pr-copy-from-oss.yml ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ name: Open A PR to Copy Code From OSS
2
+
3
+ on:
4
+ workflow_dispatch:
5
+ # schedule:
6
+ # - cron: '0 10 * * *'
7
+
8
+ permissions:
9
+ contents: write
10
+
11
+ jobs:
12
+ copy:
13
+ runs-on: ubuntu-latest
14
+ steps:
15
+ - name: Checkout repository
16
+ uses: actions/checkout@v4
17
+ with:
18
+ ref: 'main'
19
+
20
+ - name: Install GitHub CLI (if not present)
21
+ run: |
22
+ bash scripts/code_sync/install_github_cli.sh
23
+
24
+ - name: Copy from OSS code
25
+ env:
26
+ GH_TOKEN: ${{ secrets.GH_PAT_FOR_OPEN_PR_TO_PRIVATE }}
27
+ run: |
28
+ python3 scripts/code_sync/copy_from_oss.py
sglang/.github/workflows/release-branch-cut.yml ADDED
@@ -0,0 +1,213 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ name: Release Branch Cut
2
+
3
+ on:
4
+ workflow_dispatch:
5
+ inputs:
6
+ branch_name:
7
+ description: 'Branch name to create (e.g., release/v0.5.7)'
8
+ required: true
9
+ type: string
10
+ commit_sha:
11
+ description: 'Commit SHA from main to cut the release branch from (defaults to latest main)'
12
+ required: false
13
+ type: string
14
+ default: ''
15
+
16
+ permissions:
17
+ actions: write
18
+ contents: write
19
+ pull-requests: read
20
+
21
+ jobs:
22
+ cut-release-branch:
23
+ if: github.repository == 'sgl-project/sglang'
24
+ runs-on: ubuntu-latest
25
+ environment: 'prod'
26
+ outputs:
27
+ branch_name: ${{ steps.set_output.outputs.branch_name }}
28
+ steps:
29
+ - name: Checkout repository
30
+ uses: actions/checkout@v4
31
+ with:
32
+ ref: main
33
+ fetch-depth: 0
34
+ token: ${{ secrets.GITHUB_TOKEN }}
35
+
36
+ - name: Validate branch name
37
+ run: |
38
+ BRANCH_NAME="${{ github.event.inputs.branch_name }}"
39
+
40
+ if [ -z "$BRANCH_NAME" ]; then
41
+ echo "::error::Branch name is required"
42
+ exit 1
43
+ fi
44
+
45
+ # Validate branch name format (should start with release/)
46
+ if [[ ! "$BRANCH_NAME" =~ ^release/ ]]; then
47
+ echo "::warning::Branch name '$BRANCH_NAME' does not follow convention 'release/vX.Y.Z'"
48
+ fi
49
+
50
+ echo "Branch name: $BRANCH_NAME"
51
+
52
+ - name: Validate commit SHA
53
+ id: validate
54
+ run: |
55
+ COMMIT_SHA="${{ github.event.inputs.commit_sha }}"
56
+
57
+ # If no commit SHA provided, use latest main
58
+ if [ -z "$COMMIT_SHA" ]; then
59
+ COMMIT_SHA=$(git rev-parse HEAD)
60
+ echo "No commit SHA provided, using latest main: $COMMIT_SHA"
61
+ fi
62
+
63
+ # Verify the commit exists and is on main
64
+ if ! git cat-file -t "$COMMIT_SHA" > /dev/null 2>&1; then
65
+ echo "::error::Commit SHA '$COMMIT_SHA' does not exist"
66
+ exit 1
67
+ fi
68
+
69
+ # Check if commit is an ancestor of main (i.e., is on main branch)
70
+ if ! git merge-base --is-ancestor "$COMMIT_SHA" main; then
71
+ echo "::error::Commit SHA '$COMMIT_SHA' is not on the main branch"
72
+ exit 1
73
+ fi
74
+
75
+ echo "COMMIT_SHA=$COMMIT_SHA" >> $GITHUB_OUTPUT
76
+ echo "Validated commit SHA: $COMMIT_SHA"
77
+
78
+ - name: Check if branch already exists
79
+ run: |
80
+ BRANCH_NAME="${{ github.event.inputs.branch_name }}"
81
+
82
+ if git ls-remote --heads origin "$BRANCH_NAME" | grep -q "$BRANCH_NAME"; then
83
+ echo "::error::Branch '$BRANCH_NAME' already exists"
84
+ exit 1
85
+ fi
86
+
87
+ echo "Branch '$BRANCH_NAME' does not exist, proceeding with creation"
88
+
89
+ - name: Create release branch
90
+ id: set_output
91
+ run: |
92
+ COMMIT_SHA="${{ steps.validate.outputs.COMMIT_SHA }}"
93
+ BRANCH_NAME="${{ github.event.inputs.branch_name }}"
94
+
95
+ git config user.name "sglang-bot"
96
+ git config user.email "sglang-bot@users.noreply.github.com"
97
+
98
+ # Create branch from the specified commit
99
+ git checkout -b "$BRANCH_NAME" "$COMMIT_SHA"
100
+
101
+ echo "branch_name=$BRANCH_NAME" >> $GITHUB_OUTPUT
102
+ echo "Successfully created branch '$BRANCH_NAME' from commit '$COMMIT_SHA'"
103
+
104
+ - name: Update version references in documentation
105
+ run: |
106
+ BRANCH_NAME="${{ github.event.inputs.branch_name }}"
107
+ # Extract version from branch name (e.g., release/v0.5.8 -> v0.5.8)
108
+ VERSION=$(echo "$BRANCH_NAME" | sed 's/release\///')
109
+
110
+ # Update git clone version references in docs
111
+ sed -i "s/git clone -b v[0-9]\+\.[0-9]\+\.[0-9]\+\.\?post\?[0-9]*/git clone -b $VERSION/" docs/get_started/install.md
112
+ sed -i "s/git clone -b v[0-9]\+\.[0-9]\+\.[0-9]\+\.\?post\?[0-9]*/git clone -b $VERSION/" docs/platforms/amd_gpu.md
113
+
114
+ # Check if any changes were made
115
+ if git diff --quiet; then
116
+ echo "No version references needed updating"
117
+ else
118
+ git add docs/get_started/install.md docs/platforms/amd_gpu.md
119
+ git commit -m "docs: update version references to $VERSION"
120
+ echo "Updated version references to $VERSION"
121
+ fi
122
+
123
+ - name: Push release branch
124
+ run: |
125
+ BRANCH_NAME="${{ steps.set_output.outputs.branch_name }}"
126
+ git push origin "$BRANCH_NAME"
127
+ echo "Successfully pushed branch '$BRANCH_NAME'"
128
+
129
+ - name: Summary
130
+ run: |
131
+ COMMIT_SHA="${{ steps.validate.outputs.COMMIT_SHA }}"
132
+ BRANCH_NAME="${{ github.event.inputs.branch_name }}"
133
+
134
+ echo "## Release Branch Cut Summary" >> $GITHUB_STEP_SUMMARY
135
+ echo "" >> $GITHUB_STEP_SUMMARY
136
+ echo "| Property | Value |" >> $GITHUB_STEP_SUMMARY
137
+ echo "|----------|-------|" >> $GITHUB_STEP_SUMMARY
138
+ echo "| Branch | \`$BRANCH_NAME\` |" >> $GITHUB_STEP_SUMMARY
139
+ echo "| Commit | \`$COMMIT_SHA\` |" >> $GITHUB_STEP_SUMMARY
140
+ echo "| Triggered by | @${{ github.actor }} |" >> $GITHUB_STEP_SUMMARY
141
+ echo "" >> $GITHUB_STEP_SUMMARY
142
+ echo "### Next Steps" >> $GITHUB_STEP_SUMMARY
143
+ echo "1. Tests are automatically triggered on the release branch" >> $GITHUB_STEP_SUMMARY
144
+ echo "2. Apply any hotfixes if needed" >> $GITHUB_STEP_SUMMARY
145
+ echo "3. Create a tag to trigger release: \`gh workflow run release-tag.yml -f version=X.Y.Z -f ref=$BRANCH_NAME\`" >> $GITHUB_STEP_SUMMARY
146
+
147
+ run-pr-tests-nvidia:
148
+ needs: cut-release-branch
149
+ uses: ./.github/workflows/pr-test.yml
150
+ with:
151
+ ref: ${{ needs.cut-release-branch.outputs.branch_name }}
152
+ run_all_tests: true
153
+ secrets: inherit
154
+
155
+ run-pr-tests-amd:
156
+ needs: cut-release-branch
157
+ uses: ./.github/workflows/pr-test-amd.yml
158
+ with:
159
+ ref: ${{ needs.cut-release-branch.outputs.branch_name }}
160
+ run_all_tests: true
161
+ secrets: inherit
162
+
163
+ run-pr-test-npu:
164
+ needs: cut-release-branch
165
+ uses: ./.github/workflows/pr-test-npu.yml
166
+ with:
167
+ ref: ${{ needs.cut-release-branch.outputs.branch_name }}
168
+ run_all_tests: true
169
+ secrets: inherit
170
+
171
+ run-pr-tests-xeon:
172
+ needs: cut-release-branch
173
+ uses: ./.github/workflows/pr-test-xeon.yml
174
+ with:
175
+ ref: ${{ needs.cut-release-branch.outputs.branch_name }}
176
+ run_all_tests: true
177
+ secrets: inherit
178
+
179
+ run-pr-tests-xpu:
180
+ needs: cut-release-branch
181
+ uses: ./.github/workflows/pr-test-xpu.yml
182
+ with:
183
+ ref: ${{ needs.cut-release-branch.outputs.branch_name }}
184
+ run_all_tests: true
185
+ secrets: inherit
186
+
187
+ run-nightly-tests-nvidia:
188
+ needs: cut-release-branch
189
+ uses: ./.github/workflows/nightly-test-nvidia.yml
190
+ with:
191
+ ref: ${{ needs.cut-release-branch.outputs.branch_name }}
192
+ secrets: inherit
193
+
194
+ run-nightly-tests-amd:
195
+ needs: cut-release-branch
196
+ uses: ./.github/workflows/nightly-test-amd.yml
197
+ with:
198
+ ref: ${{ needs.cut-release-branch.outputs.branch_name }}
199
+ secrets: inherit
200
+
201
+ run-nightly-tests-npu:
202
+ needs: cut-release-branch
203
+ uses: ./.github/workflows/nightly-test-npu.yml
204
+ with:
205
+ ref: ${{ needs.cut-release-branch.outputs.branch_name }}
206
+ secrets: inherit
207
+
208
+ run-nightly-tests-intel:
209
+ needs: cut-release-branch
210
+ uses: ./.github/workflows/nightly-test-intel.yml
211
+ with:
212
+ ref: ${{ needs.cut-release-branch.outputs.branch_name }}
213
+ secrets: inherit
sglang/.github/workflows/rerun-ut.yml ADDED
@@ -0,0 +1,71 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ name: Rerun UT
2
+ run-name: ${{ inputs.pr_head_sha && format('[rerun-ut] {0}', inputs.pr_head_sha) || '[rerun-ut]' }}
3
+
4
+ on:
5
+ workflow_dispatch:
6
+ inputs:
7
+ test_command:
8
+ description: "Test command to run (e.g. 'registered/core/test_srt_endpoint.py TestSRTEndpoint.test_simple_decode')"
9
+ required: true
10
+ type: string
11
+ runner_label:
12
+ description: "Runner label (e.g. '1-gpu-runner', '1-gpu-5090', '4-gpu-h100')"
13
+ required: true
14
+ type: string
15
+ pr_head_sha:
16
+ description: "PR head SHA to checkout (for /rerun-ut on fork PRs)"
17
+ required: false
18
+ type: string
19
+ default: ""
20
+ use_deepep:
21
+ description: "Use ci_install_deepep.sh instead of ci_install_dependency.sh"
22
+ required: false
23
+ type: string
24
+ default: "false"
25
+
26
+ env:
27
+ SGLANG_IS_IN_CI: true
28
+ SGLANG_CUDA_COREDUMP: "1"
29
+ SGLANG_JIT_DEEPGEMM_FAST_WARMUP: true
30
+
31
+ permissions:
32
+ actions: write
33
+ contents: read
34
+
35
+ jobs:
36
+ rerun-ut-cuda:
37
+ runs-on: ${{ inputs.runner_label }}
38
+ timeout-minutes: 120
39
+ env:
40
+ RUNNER_LABELS: ${{ inputs.runner_label }}
41
+ IS_BLACKWELL: ${{ (inputs.runner_label == '1-gpu-5090' || contains(inputs.runner_label, 'b200')) && '1' || '' }}
42
+ SGLANG_CI_RDMA_ALL_DEVICES: ${{ inputs.runner_label == '8-gpu-h20' && 'mlx5_1,mlx5_2,mlx5_3,mlx5_4' || '' }}
43
+ steps:
44
+ - name: Checkout code
45
+ uses: actions/checkout@v4
46
+ with:
47
+ ref: ${{ inputs.pr_head_sha || github.sha }}
48
+
49
+ - name: Install dependencies
50
+ timeout-minutes: 20
51
+ run: |
52
+ if [[ "${{ inputs.runner_label }}" == "1-gpu-5090" ]]; then
53
+ source /etc/profile.d/sglang-ci.sh
54
+ fi
55
+ if [[ "${{ inputs.use_deepep }}" == "true" ]]; then
56
+ bash scripts/ci/cuda/ci_install_deepep.sh
57
+ else
58
+ bash scripts/ci/cuda/ci_install_dependency.sh
59
+ fi
60
+
61
+ - name: Run test
62
+ timeout-minutes: 60
63
+ run: |
64
+ if [[ "${{ inputs.runner_label }}" == "1-gpu-5090" ]]; then
65
+ source /etc/profile.d/sglang-ci.sh
66
+ fi
67
+ cd test/
68
+ python3 ${{ inputs.test_command }}
69
+
70
+ - uses: ./.github/actions/upload-cuda-coredumps
71
+ if: always()
sglang/docs/_static/css/custom_log.css ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ .output_area {
2
+ color: #615656;
3
+ }
4
+
5
+ table.autosummary td {
6
+ width: 50%
7
+ }
8
+
9
+ img.align-center {
10
+ display: block;
11
+ margin-left: auto;
12
+ margin-right: auto;
13
+ }
14
+
15
+ .output_area.stderr {
16
+ color: #d3d3d3 !important;
17
+ }
18
+
19
+ .output_area.stdout {
20
+ color: #d3d3d3 !important;
21
+ }
22
+
23
+ div.output_area.stderr {
24
+ color: #d3d3d3 !important;
25
+ }
26
+
27
+ div.output_area.stdout {
28
+ color: #d3d3d3 !important;
29
+ }
sglang/docs/_static/css/readthedocs.css ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ table.autosummary td {
2
+ width: 50%
3
+ }
4
+
5
+ img.align-center {
6
+ display: block;
7
+ margin-left: auto;
8
+ margin-right: auto;
9
+ }
sglang/docs/_static/image/logo.ico ADDED
sglang/docs/advanced_features/checkpoint_engine.md ADDED
@@ -0,0 +1,254 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Checkpoint Engine Integration
2
+
3
+ The SGLang checkpoint engine integration provides an efficient way to load model weights using a distributed checkpoint loading system. This feature significantly reduces model loading time, especially for large models and multi-node setups, by parallelizing the weight loading process across multiple processes and nodes.
4
+
5
+ ## Overview
6
+
7
+ The checkpoint engine integration allows SGLang to:
8
+ - Load model weights in parallel using multiple processes
9
+ - Distribute weight loading across multiple nodes to increase effective disk bandwidth
10
+ - Overlap weight loading with other initialization tasks like CUDA graph capture
11
+ - Support both single-node and multi-node deployments
12
+
13
+ ## Installation
14
+
15
+ First, install the checkpoint engine package:
16
+
17
+ ```bash
18
+ pip install 'checkpoint-engine[p2p]'
19
+ ```
20
+
21
+ ## Architecture
22
+
23
+ The system consists of two main components:
24
+
25
+ 1. **SGLang Server**: Runs with `--wait-for-initial-weights` flag to wait for weights before becoming ready
26
+ 2. **Checkpoint Engine Workers**: Separate processes (managed by torchrun) that load and distribute model weights
27
+
28
+ The checkpoint engine uses a parameter server architecture with support for:
29
+ - **Broadcast mode**: Weights are broadcast from loading processes to inference processes
30
+ - **P2P mode**: Direct peer-to-peer weight transfer between processes
31
+ - **All mode**: Combination of both broadcast and P2P methods
32
+
33
+ ## Usage Examples
34
+
35
+ ### Single Node Setup
36
+
37
+ **Terminal 1 - Launch SGLang Server:**
38
+ ```bash
39
+ python -m sglang.launch_server \
40
+ --model-path Qwen/Qwen3-8B \
41
+ --tp 8 \
42
+ --load-format dummy \
43
+ --wait-for-initial-weights
44
+ ```
45
+
46
+ **Terminal 2 - Run Checkpoint Engine:**
47
+
48
+ Using sglang entrypoint:
49
+ ```bash
50
+ python -m sglang.srt.checkpoint_engine.update \
51
+ --update-method broadcast \
52
+ --checkpoint-path /path/to/Qwen/Qwen3-8B/ \
53
+ --inference-parallel-size 8
54
+ ```
55
+
56
+ Using torchrun directly:
57
+ ```bash
58
+ torchrun --nproc-per-node 8 \
59
+ examples/checkpoint_engine/update.py \
60
+ --update-method broadcast \
61
+ --checkpoint-path /path/to/Qwen/Qwen3-8B/ \
62
+ --inference-parallel-size 8
63
+ ```
64
+
65
+ ### Multi-Node Setup (2 Nodes)
66
+
67
+ **Node 0:**
68
+
69
+ Launch SGLang server:
70
+ ```bash
71
+ python -m sglang.launch_server \
72
+ --model-path Qwen/Qwen3-8B \
73
+ --tp 8 \
74
+ --load-format dummy \
75
+ --wait-for-initial-weights \
76
+ --host [IP]
77
+ ```
78
+
79
+ Run checkpoint engine:
80
+
81
+ Using sglang entrypoint (recommended):
82
+ ```bash
83
+ python -m sglang.srt.checkpoint_engine.update \
84
+ --update-method broadcast \
85
+ --checkpoint-path /path/to/Qwen/Qwen3-8B/ \
86
+ --inference-parallel-size 8
87
+ ```
88
+
89
+ Using torchrun directly:
90
+ ```bash
91
+ torchrun --nproc-per-node 8 \
92
+ --nnodes 2 \
93
+ --node-rank 0 \
94
+ --master-addr [IP] \
95
+ --master-port 29500 \
96
+ examples/checkpoint_engine/update.py \
97
+ --update-method broadcast \
98
+ --checkpoint-path /path/to/Qwen/Qwen3-8B/ \
99
+ --inference-parallel-size 8
100
+ ```
101
+
102
+ **Node 1:**
103
+
104
+ Launch SGLang server:
105
+ ```bash
106
+ python -m sglang.launch_server \
107
+ --model-path Qwen/Qwen3-8B \
108
+ --tp 8 \
109
+ --load-format dummy \
110
+ --wait-for-initial-weights \
111
+ --host [IP]
112
+ ```
113
+
114
+ Run checkpoint engine:
115
+
116
+ Using sglang entrypoint (recommended):
117
+ ```bash
118
+ python -m sglang.srt.checkpoint_engine.update \
119
+ --update-method broadcast \
120
+ --checkpoint-path /path/to/Qwen/Qwen3-8B/ \
121
+ --inference-parallel-size 8
122
+ ```
123
+
124
+ Using torchrun directly:
125
+ ```bash
126
+ torchrun --nproc-per-node 8 \
127
+ --nnodes 2 \
128
+ --node-rank 1 \
129
+ --master-addr [IP] \
130
+ --master-port 29500 \
131
+ examples/checkpoint_engine/update.py \
132
+ --update-method broadcast \
133
+ --checkpoint-path /path/to/Qwen/Qwen3-8B/ \
134
+ --inference-parallel-size 8
135
+ ```
136
+
137
+ ### Multi-Node Setup with Tensor Parallelism (TP=16)
138
+
139
+ **Node 0:**
140
+
141
+ Launch SGLang server:
142
+ ```bash
143
+ python -m sglang.launch_server \
144
+ --model-path Qwen/Qwen3-8B \
145
+ --tp 8 \
146
+ --load-format dummy \
147
+ --wait-for-initial-weights \
148
+ --host [IP] \
149
+ --dist-init-addr [IP]:9120 \
150
+ --nnodes 2 \
151
+ --node-rank 0
152
+ ```
153
+
154
+ Run checkpoint engine:
155
+
156
+ Using sglang entrypoint (recommended):
157
+ ```bash
158
+ python -m sglang.srt.checkpoint_engine.update \
159
+ --update-method broadcast \
160
+ --checkpoint-path /path/to/Qwen/Qwen3-8B/ \
161
+ --inference-parallel-size 16
162
+ ```
163
+
164
+ Using torchrun directly:
165
+ ```bash
166
+ torchrun --nproc-per-node 8 \
167
+ --nnodes 2 \
168
+ --node-rank 0 \
169
+ --master-addr [IP] \
170
+ --master-port 29500 \
171
+ examples/checkpoint_engine/update.py \
172
+ --update-method broadcast \
173
+ --checkpoint-path /path/to/Qwen/Qwen3-8B/ \
174
+ --inference-parallel-size 16
175
+ ```
176
+
177
+ **Node 1:**
178
+
179
+ Launch SGLang server:
180
+ ```bash
181
+ python -m sglang.launch_server \
182
+ --model-path Qwen/Qwen3-8B \
183
+ --tp 8 \
184
+ --load-format dummy \
185
+ --wait-for-initial-weights \
186
+ --host [IP] \
187
+ --dist-init-addr [IP]:9120 \
188
+ --nnodes 2 \
189
+ --node-rank 1
190
+ ```
191
+
192
+ Run checkpoint engine:
193
+
194
+ Using sglang entrypoint (recommended):
195
+ ```bash
196
+ python -m sglang.srt.checkpoint_engine.update \
197
+ --update-method broadcast \
198
+ --checkpoint-path /path/to/Qwen/Qwen3-8B/ \
199
+ --inference-parallel-size 16
200
+ ```
201
+
202
+ Using torchrun directly:
203
+ ```bash
204
+ torchrun --nproc-per-node 8 \
205
+ --nnodes 2 \
206
+ --node-rank 1 \
207
+ --master-addr [IP] \
208
+ --master-port 29500 \
209
+ examples/checkpoint_engine/update.py \
210
+ --update-method broadcast \
211
+ --checkpoint-path /path/to/Qwen/Qwen3-8B/ \
212
+ --inference-parallel-size 16
213
+ ```
214
+
215
+ ## Configuration Options
216
+
217
+ ### SGLang Server Options
218
+
219
+ - `--load-format dummy`: Use dummy format for initial loading (allows overlapping with other tasks)
220
+ - `--wait-for-initial-weights`: Wait for checkpoint engine to provide weights before becoming ready
221
+ - `--host`: Host address for multi-node setups
222
+ - `--dist-init-addr`: Distributed initialization address for tensor parallelism
223
+
224
+ ### Checkpoint Engine Options
225
+
226
+ - `--update-method`: Weight update method (`broadcast`, `p2p`, or `all`)
227
+ - `--checkpoint-path`: Path to model checkpoint directory
228
+ - `--inference-parallel-size`: Number of inference parallel processes
229
+ - `--endpoint`: SGLang server endpoint (default: `http://localhost:19730`)
230
+ - `--checkpoint-name`: Name for the checkpoint (default: `my-checkpoint-iter-0`)
231
+ - `--save-metas-file`: File to save checkpoint metadata
232
+ - `--load-metas-file`: File to load checkpoint metadata from
233
+ - `--uds`: Unix domain socket path for communication
234
+ - `--weight-version`: Version identifier for weights
235
+
236
+ ## Performance Benefits
237
+
238
+ The checkpoint engine provides significant time savings in two main aspects:
239
+
240
+ 1. **Multi-node Loading**: Each node only loads a portion of weights from disk, effectively increasing disk bandwidth. More participating nodes provide greater acceleration. Preliminary tests show 20-second acceleration when loading DeepSeek-R1 on H20-3e with two nodes.
241
+
242
+ 2. **Single Process Optimization**: Using dummy format allows overlapping disk-to-CPU transfer with CUDA graph capture and other initialization tasks, providing additional time savings.
243
+
244
+ ## Troubleshooting
245
+
246
+ - Ensure checkpoint engine package is installed: `pip install 'checkpoint-engine[p2p]'`
247
+ - Verify network connectivity between nodes in multi-node setups
248
+ - Check that the checkpoint path contains valid model files
249
+ - Monitor logs for connection errors between SGLang server and checkpoint engine
250
+ - Use `--sleep-time` parameter to add delays if needed for debugging
251
+
252
+ ## References
253
+
254
+ - [Checkpoint Engine Repository](https://github.com/MoonshotAI/checkpoint-engine)
sglang/docs/advanced_features/structured_outputs.ipynb ADDED
@@ -0,0 +1,997 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "metadata": {},
6
+ "source": [
7
+ "# Structured Outputs"
8
+ ]
9
+ },
10
+ {
11
+ "cell_type": "markdown",
12
+ "metadata": {},
13
+ "source": [
14
+ "You can specify a JSON schema, [regular expression](https://en.wikipedia.org/wiki/Regular_expression) or [EBNF](https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form) to constrain the model output. The model output will be guaranteed to follow the given constraints. Only one constraint parameter (`json_schema`, `regex`, or `ebnf`) can be specified for a request.\n",
15
+ "\n",
16
+ "SGLang supports three grammar backends:\n",
17
+ "\n",
18
+ "- [XGrammar](https://github.com/mlc-ai/xgrammar)(default): Supports JSON schema, regular expression, and EBNF constraints.\n",
19
+ "- [Outlines](https://github.com/dottxt-ai/outlines): Supports JSON schema and regular expression constraints.\n",
20
+ "- [Llguidance](https://github.com/guidance-ai/llguidance): Supports JSON schema, regular expression, and EBNF constraints.\n",
21
+ "\n",
22
+ "We suggest using XGrammar for its better performance and utility. XGrammar currently uses the [GGML BNF format](https://github.com/ggerganov/llama.cpp/blob/master/grammars/README.md). For more details, see [XGrammar technical overview](https://blog.mlc.ai/2024/11/22/achieving-efficient-flexible-portable-structured-generation-with-xgrammar).\n",
23
+ "\n",
24
+ "To use Outlines, simply add `--grammar-backend outlines` when launching the server.\n",
25
+ "To use llguidance, add `--grammar-backend llguidance` when launching the server.\n",
26
+ "If no backend is specified, XGrammar will be used as the default.\n",
27
+ "\n",
28
+ "For better output quality, **It's advisable to explicitly include instructions in the prompt to guide the model to generate the desired format.** For example, you can specify, 'Please generate the output in the following JSON format: ...'.\n"
29
+ ]
30
+ },
31
+ {
32
+ "cell_type": "markdown",
33
+ "metadata": {},
34
+ "source": [
35
+ "## OpenAI Compatible API"
36
+ ]
37
+ },
38
+ {
39
+ "cell_type": "code",
40
+ "execution_count": null,
41
+ "metadata": {},
42
+ "outputs": [],
43
+ "source": [
44
+ "import openai\n",
45
+ "import os\n",
46
+ "\n",
47
+ "from sglang.test.doc_patch import launch_server_cmd\n",
48
+ "from sglang.utils import wait_for_server, print_highlight, terminate_process\n",
49
+ "\n",
50
+ "os.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\"\n",
51
+ "\n",
52
+ "\n",
53
+ "server_process, port = launch_server_cmd(\n",
54
+ " \"python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --host 0.0.0.0 --log-level warning\"\n",
55
+ ")\n",
56
+ "\n",
57
+ "wait_for_server(f\"http://localhost:{port}\", process=server_process)\n",
58
+ "client = openai.Client(base_url=f\"http://127.0.0.1:{port}/v1\", api_key=\"None\")"
59
+ ]
60
+ },
61
+ {
62
+ "cell_type": "markdown",
63
+ "metadata": {},
64
+ "source": [
65
+ "### JSON\n",
66
+ "\n",
67
+ "you can directly define a JSON schema or use [Pydantic](https://docs.pydantic.dev/latest/) to define and validate the response."
68
+ ]
69
+ },
70
+ {
71
+ "cell_type": "markdown",
72
+ "metadata": {},
73
+ "source": [
74
+ "**Using Pydantic**"
75
+ ]
76
+ },
77
+ {
78
+ "cell_type": "code",
79
+ "execution_count": null,
80
+ "metadata": {},
81
+ "outputs": [],
82
+ "source": [
83
+ "from pydantic import BaseModel, Field\n",
84
+ "\n",
85
+ "\n",
86
+ "# Define the schema using Pydantic\n",
87
+ "class CapitalInfo(BaseModel):\n",
88
+ " name: str = Field(..., pattern=r\"^\\w+$\", description=\"Name of the capital city\")\n",
89
+ " population: int = Field(..., description=\"Population of the capital city\")\n",
90
+ "\n",
91
+ "\n",
92
+ "response = client.chat.completions.create(\n",
93
+ " model=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n",
94
+ " messages=[\n",
95
+ " {\n",
96
+ " \"role\": \"user\",\n",
97
+ " \"content\": \"Please generate the information of the capital of France in the JSON format.\",\n",
98
+ " },\n",
99
+ " ],\n",
100
+ " temperature=0,\n",
101
+ " max_tokens=128,\n",
102
+ " response_format={\n",
103
+ " \"type\": \"json_schema\",\n",
104
+ " \"json_schema\": {\n",
105
+ " \"name\": \"foo\",\n",
106
+ " # convert the pydantic model to json schema\n",
107
+ " \"schema\": CapitalInfo.model_json_schema(),\n",
108
+ " },\n",
109
+ " },\n",
110
+ ")\n",
111
+ "\n",
112
+ "response_content = response.choices[0].message.content\n",
113
+ "# validate the JSON response by the pydantic model\n",
114
+ "capital_info = CapitalInfo.model_validate_json(response_content)\n",
115
+ "print_highlight(f\"Validated response: {capital_info.model_dump_json()}\")"
116
+ ]
117
+ },
118
+ {
119
+ "cell_type": "markdown",
120
+ "metadata": {},
121
+ "source": [
122
+ "**JSON Schema Directly**\n"
123
+ ]
124
+ },
125
+ {
126
+ "cell_type": "code",
127
+ "execution_count": null,
128
+ "metadata": {},
129
+ "outputs": [],
130
+ "source": [
131
+ "import json\n",
132
+ "\n",
133
+ "json_schema = json.dumps(\n",
134
+ " {\n",
135
+ " \"type\": \"object\",\n",
136
+ " \"properties\": {\n",
137
+ " \"name\": {\"type\": \"string\", \"pattern\": \"^[\\\\w]+$\"},\n",
138
+ " \"population\": {\"type\": \"integer\"},\n",
139
+ " },\n",
140
+ " \"required\": [\"name\", \"population\"],\n",
141
+ " }\n",
142
+ ")\n",
143
+ "\n",
144
+ "response = client.chat.completions.create(\n",
145
+ " model=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n",
146
+ " messages=[\n",
147
+ " {\n",
148
+ " \"role\": \"user\",\n",
149
+ " \"content\": \"Give me the information of the capital of France in the JSON format.\",\n",
150
+ " },\n",
151
+ " ],\n",
152
+ " temperature=0,\n",
153
+ " max_tokens=128,\n",
154
+ " response_format={\n",
155
+ " \"type\": \"json_schema\",\n",
156
+ " \"json_schema\": {\"name\": \"foo\", \"schema\": json.loads(json_schema)},\n",
157
+ " },\n",
158
+ ")\n",
159
+ "\n",
160
+ "print_highlight(response.choices[0].message.content)"
161
+ ]
162
+ },
163
+ {
164
+ "cell_type": "markdown",
165
+ "metadata": {},
166
+ "source": [
167
+ "### EBNF"
168
+ ]
169
+ },
170
+ {
171
+ "cell_type": "code",
172
+ "execution_count": null,
173
+ "metadata": {},
174
+ "outputs": [],
175
+ "source": [
176
+ "ebnf_grammar = \"\"\"\n",
177
+ "root ::= city | description\n",
178
+ "city ::= \"London\" | \"Paris\" | \"Berlin\" | \"Rome\"\n",
179
+ "description ::= city \" is \" status\n",
180
+ "status ::= \"the capital of \" country\n",
181
+ "country ::= \"England\" | \"France\" | \"Germany\" | \"Italy\"\n",
182
+ "\"\"\"\n",
183
+ "\n",
184
+ "response = client.chat.completions.create(\n",
185
+ " model=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n",
186
+ " messages=[\n",
187
+ " {\"role\": \"system\", \"content\": \"You are a helpful geography bot.\"},\n",
188
+ " {\n",
189
+ " \"role\": \"user\",\n",
190
+ " \"content\": \"Give me the information of the capital of France.\",\n",
191
+ " },\n",
192
+ " ],\n",
193
+ " temperature=0,\n",
194
+ " max_tokens=32,\n",
195
+ " extra_body={\"ebnf\": ebnf_grammar},\n",
196
+ ")\n",
197
+ "\n",
198
+ "print_highlight(response.choices[0].message.content)"
199
+ ]
200
+ },
201
+ {
202
+ "cell_type": "markdown",
203
+ "metadata": {},
204
+ "source": [
205
+ "### Regular expression"
206
+ ]
207
+ },
208
+ {
209
+ "cell_type": "code",
210
+ "execution_count": null,
211
+ "metadata": {},
212
+ "outputs": [],
213
+ "source": [
214
+ "response = client.chat.completions.create(\n",
215
+ " model=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n",
216
+ " messages=[\n",
217
+ " {\"role\": \"user\", \"content\": \"What is the capital of France?\"},\n",
218
+ " ],\n",
219
+ " temperature=0,\n",
220
+ " max_tokens=128,\n",
221
+ " extra_body={\"regex\": \"(Paris|London)\"},\n",
222
+ ")\n",
223
+ "\n",
224
+ "print_highlight(response.choices[0].message.content)"
225
+ ]
226
+ },
227
+ {
228
+ "cell_type": "markdown",
229
+ "metadata": {},
230
+ "source": [
231
+ "### Structural Tag"
232
+ ]
233
+ },
234
+ {
235
+ "cell_type": "code",
236
+ "execution_count": null,
237
+ "metadata": {},
238
+ "outputs": [],
239
+ "source": [
240
+ "tool_get_current_weather = {\n",
241
+ " \"type\": \"function\",\n",
242
+ " \"function\": {\n",
243
+ " \"name\": \"get_current_weather\",\n",
244
+ " \"description\": \"Get the current weather in a given location\",\n",
245
+ " \"parameters\": {\n",
246
+ " \"type\": \"object\",\n",
247
+ " \"properties\": {\n",
248
+ " \"city\": {\n",
249
+ " \"type\": \"string\",\n",
250
+ " \"description\": \"The city to find the weather for, e.g. 'San Francisco'\",\n",
251
+ " },\n",
252
+ " \"state\": {\n",
253
+ " \"type\": \"string\",\n",
254
+ " \"description\": \"the two-letter abbreviation for the state that the city is\"\n",
255
+ " \" in, e.g. 'CA' which would mean 'California'\",\n",
256
+ " },\n",
257
+ " \"unit\": {\n",
258
+ " \"type\": \"string\",\n",
259
+ " \"description\": \"The unit to fetch the temperature in\",\n",
260
+ " \"enum\": [\"celsius\", \"fahrenheit\"],\n",
261
+ " },\n",
262
+ " },\n",
263
+ " \"required\": [\"city\", \"state\", \"unit\"],\n",
264
+ " },\n",
265
+ " },\n",
266
+ "}\n",
267
+ "\n",
268
+ "tool_get_current_date = {\n",
269
+ " \"type\": \"function\",\n",
270
+ " \"function\": {\n",
271
+ " \"name\": \"get_current_date\",\n",
272
+ " \"description\": \"Get the current date and time for a given timezone\",\n",
273
+ " \"parameters\": {\n",
274
+ " \"type\": \"object\",\n",
275
+ " \"properties\": {\n",
276
+ " \"timezone\": {\n",
277
+ " \"type\": \"string\",\n",
278
+ " \"description\": \"The timezone to fetch the current date and time for, e.g. 'America/New_York'\",\n",
279
+ " }\n",
280
+ " },\n",
281
+ " \"required\": [\"timezone\"],\n",
282
+ " },\n",
283
+ " },\n",
284
+ "}\n",
285
+ "\n",
286
+ "schema_get_current_weather = tool_get_current_weather[\"function\"][\"parameters\"]\n",
287
+ "schema_get_current_date = tool_get_current_date[\"function\"][\"parameters\"]\n",
288
+ "\n",
289
+ "\n",
290
+ "def get_messages():\n",
291
+ " return [\n",
292
+ " {\n",
293
+ " \"role\": \"system\",\n",
294
+ " \"content\": f\"\"\"\n",
295
+ "# Tool Instructions\n",
296
+ "- Always execute python code in messages that you share.\n",
297
+ "- When looking for real time information use relevant functions if available else fallback to brave_search\n",
298
+ "You have access to the following functions:\n",
299
+ "Use the function 'get_current_weather' to: Get the current weather in a given location\n",
300
+ "{tool_get_current_weather[\"function\"]}\n",
301
+ "Use the function 'get_current_date' to: Get the current date and time for a given timezone\n",
302
+ "{tool_get_current_date[\"function\"]}\n",
303
+ "If a you choose to call a function ONLY reply in the following format:\n",
304
+ "<{{start_tag}}={{function_name}}>{{parameters}}{{end_tag}}\n",
305
+ "where\n",
306
+ "start_tag => `<function`\n",
307
+ "parameters => a JSON dict with the function argument name as key and function argument value as value.\n",
308
+ "end_tag => `</function>`\n",
309
+ "Here is an example,\n",
310
+ "<function=example_function_name>{{\"example_name\": \"example_value\"}}</function>\n",
311
+ "Reminder:\n",
312
+ "- Function calls MUST follow the specified format\n",
313
+ "- Required parameters MUST be specified\n",
314
+ "- Only call one function at a time\n",
315
+ "- Put the entire function call reply on one line\n",
316
+ "- Always add your sources when using search results to answer the user query\n",
317
+ "You are a helpful assistant.\"\"\",\n",
318
+ " },\n",
319
+ " {\n",
320
+ " \"role\": \"user\",\n",
321
+ " \"content\": \"You are in New York. Please get the current date and time, and the weather.\",\n",
322
+ " },\n",
323
+ " ]\n",
324
+ "\n",
325
+ "\n",
326
+ "messages = get_messages()\n",
327
+ "\n",
328
+ "response = client.chat.completions.create(\n",
329
+ " model=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n",
330
+ " messages=messages,\n",
331
+ " response_format={\n",
332
+ " \"type\": \"structural_tag\",\n",
333
+ " \"structures\": [\n",
334
+ " {\n",
335
+ " \"begin\": \"<function=get_current_weather>\",\n",
336
+ " \"schema\": schema_get_current_weather,\n",
337
+ " \"end\": \"</function>\",\n",
338
+ " },\n",
339
+ " {\n",
340
+ " \"begin\": \"<function=get_current_date>\",\n",
341
+ " \"schema\": schema_get_current_date,\n",
342
+ " \"end\": \"</function>\",\n",
343
+ " },\n",
344
+ " ],\n",
345
+ " \"triggers\": [\"<function=\"],\n",
346
+ " },\n",
347
+ ")\n",
348
+ "\n",
349
+ "print_highlight(response.choices[0].message.content)"
350
+ ]
351
+ },
352
+ {
353
+ "cell_type": "code",
354
+ "execution_count": null,
355
+ "metadata": {},
356
+ "outputs": [],
357
+ "source": [
358
+ "# Support for XGrammar latest structural tag format\n",
359
+ "# https://xgrammar.mlc.ai/docs/tutorials/structural_tag.html\n",
360
+ "\n",
361
+ "response = client.chat.completions.create(\n",
362
+ " model=\"meta-llama/Meta-Llama-3.1-8B-Instruct\",\n",
363
+ " messages=messages,\n",
364
+ " response_format={\n",
365
+ " \"type\": \"structural_tag\",\n",
366
+ " \"format\": {\n",
367
+ " \"type\": \"triggered_tags\",\n",
368
+ " \"triggers\": [\"<function=\"],\n",
369
+ " \"tags\": [\n",
370
+ " {\n",
371
+ " \"begin\": \"<function=get_current_weather>\",\n",
372
+ " \"content\": {\n",
373
+ " \"type\": \"json_schema\",\n",
374
+ " \"json_schema\": schema_get_current_weather,\n",
375
+ " },\n",
376
+ " \"end\": \"</function>\",\n",
377
+ " },\n",
378
+ " {\n",
379
+ " \"begin\": \"<function=get_current_date>\",\n",
380
+ " \"content\": {\n",
381
+ " \"type\": \"json_schema\",\n",
382
+ " \"json_schema\": schema_get_current_date,\n",
383
+ " },\n",
384
+ " \"end\": \"</function>\",\n",
385
+ " },\n",
386
+ " ],\n",
387
+ " \"at_least_one\": False,\n",
388
+ " \"stop_after_first\": False,\n",
389
+ " },\n",
390
+ " },\n",
391
+ ")\n",
392
+ "\n",
393
+ "print_highlight(response.choices[0].message.content)"
394
+ ]
395
+ },
396
+ {
397
+ "cell_type": "markdown",
398
+ "metadata": {},
399
+ "source": [
400
+ "## Native API and SGLang Runtime (SRT)"
401
+ ]
402
+ },
403
+ {
404
+ "cell_type": "markdown",
405
+ "metadata": {},
406
+ "source": [
407
+ "### JSON"
408
+ ]
409
+ },
410
+ {
411
+ "cell_type": "markdown",
412
+ "metadata": {},
413
+ "source": [
414
+ "**Using Pydantic**"
415
+ ]
416
+ },
417
+ {
418
+ "cell_type": "code",
419
+ "execution_count": null,
420
+ "metadata": {},
421
+ "outputs": [],
422
+ "source": [
423
+ "import requests\n",
424
+ "import json\n",
425
+ "from pydantic import BaseModel, Field\n",
426
+ "\n",
427
+ "from transformers import AutoTokenizer\n",
428
+ "\n",
429
+ "tokenizer = AutoTokenizer.from_pretrained(\"meta-llama/Meta-Llama-3.1-8B-Instruct\")\n",
430
+ "\n",
431
+ "\n",
432
+ "# Define the schema using Pydantic\n",
433
+ "class CapitalInfo(BaseModel):\n",
434
+ " name: str = Field(..., pattern=r\"^\\w+$\", description=\"Name of the capital city\")\n",
435
+ " population: int = Field(..., description=\"Population of the capital city\")\n",
436
+ "\n",
437
+ "\n",
438
+ "# Make API request\n",
439
+ "messages = [\n",
440
+ " {\n",
441
+ " \"role\": \"user\",\n",
442
+ " \"content\": \"Here is the information of the capital of France in the JSON format.\\n\",\n",
443
+ " }\n",
444
+ "]\n",
445
+ "text = tokenizer.apply_chat_template(\n",
446
+ " messages, tokenize=False, add_generation_prompt=True, return_dict=False\n",
447
+ ")\n",
448
+ "response = requests.post(\n",
449
+ " f\"http://localhost:{port}/generate\",\n",
450
+ " json={\n",
451
+ " \"text\": text,\n",
452
+ " \"sampling_params\": {\n",
453
+ " \"temperature\": 0,\n",
454
+ " \"max_new_tokens\": 64,\n",
455
+ " \"json_schema\": json.dumps(CapitalInfo.model_json_schema()),\n",
456
+ " },\n",
457
+ " },\n",
458
+ ")\n",
459
+ "print_highlight(response.json())\n",
460
+ "\n",
461
+ "\n",
462
+ "response_data = json.loads(response.json()[\"text\"])\n",
463
+ "# validate the response by the pydantic model\n",
464
+ "capital_info = CapitalInfo.model_validate(response_data)\n",
465
+ "print_highlight(f\"Validated response: {capital_info.model_dump_json()}\")"
466
+ ]
467
+ },
468
+ {
469
+ "cell_type": "markdown",
470
+ "metadata": {},
471
+ "source": [
472
+ "**JSON Schema Directly**"
473
+ ]
474
+ },
475
+ {
476
+ "cell_type": "code",
477
+ "execution_count": null,
478
+ "metadata": {},
479
+ "outputs": [],
480
+ "source": [
481
+ "json_schema = json.dumps(\n",
482
+ " {\n",
483
+ " \"type\": \"object\",\n",
484
+ " \"properties\": {\n",
485
+ " \"name\": {\"type\": \"string\", \"pattern\": \"^[\\\\w]+$\"},\n",
486
+ " \"population\": {\"type\": \"integer\"},\n",
487
+ " },\n",
488
+ " \"required\": [\"name\", \"population\"],\n",
489
+ " }\n",
490
+ ")\n",
491
+ "\n",
492
+ "# JSON\n",
493
+ "response = requests.post(\n",
494
+ " f\"http://localhost:{port}/generate\",\n",
495
+ " json={\n",
496
+ " \"text\": text,\n",
497
+ " \"sampling_params\": {\n",
498
+ " \"temperature\": 0,\n",
499
+ " \"max_new_tokens\": 64,\n",
500
+ " \"json_schema\": json_schema,\n",
501
+ " },\n",
502
+ " },\n",
503
+ ")\n",
504
+ "\n",
505
+ "print_highlight(response.json())"
506
+ ]
507
+ },
508
+ {
509
+ "cell_type": "markdown",
510
+ "metadata": {},
511
+ "source": [
512
+ "### EBNF"
513
+ ]
514
+ },
515
+ {
516
+ "cell_type": "code",
517
+ "execution_count": null,
518
+ "metadata": {},
519
+ "outputs": [],
520
+ "source": [
521
+ "messages = [\n",
522
+ " {\n",
523
+ " \"role\": \"user\",\n",
524
+ " \"content\": \"Give me the information of the capital of France.\",\n",
525
+ " }\n",
526
+ "]\n",
527
+ "text = tokenizer.apply_chat_template(\n",
528
+ " messages, tokenize=False, add_generation_prompt=True, return_dict=False\n",
529
+ ")\n",
530
+ "response = requests.post(\n",
531
+ " f\"http://localhost:{port}/generate\",\n",
532
+ " json={\n",
533
+ " \"text\": text,\n",
534
+ " \"sampling_params\": {\n",
535
+ " \"max_new_tokens\": 128,\n",
536
+ " \"temperature\": 0,\n",
537
+ " \"n\": 3,\n",
538
+ " \"ebnf\": (\n",
539
+ " \"root ::= city | description\\n\"\n",
540
+ " 'city ::= \"London\" | \"Paris\" | \"Berlin\" | \"Rome\"\\n'\n",
541
+ " 'description ::= city \" is \" status\\n'\n",
542
+ " 'status ::= \"the capital of \" country\\n'\n",
543
+ " 'country ::= \"England\" | \"France\" | \"Germany\" | \"Italy\"'\n",
544
+ " ),\n",
545
+ " },\n",
546
+ " \"stream\": False,\n",
547
+ " \"return_logprob\": False,\n",
548
+ " },\n",
549
+ ")\n",
550
+ "\n",
551
+ "print_highlight(response.json())"
552
+ ]
553
+ },
554
+ {
555
+ "cell_type": "markdown",
556
+ "metadata": {},
557
+ "source": [
558
+ "### Regular expression"
559
+ ]
560
+ },
561
+ {
562
+ "cell_type": "code",
563
+ "execution_count": null,
564
+ "metadata": {},
565
+ "outputs": [],
566
+ "source": [
567
+ "messages = [\n",
568
+ " {\n",
569
+ " \"role\": \"user\",\n",
570
+ " \"content\": \"Paris is the capital of\",\n",
571
+ " }\n",
572
+ "]\n",
573
+ "text = tokenizer.apply_chat_template(\n",
574
+ " messages, tokenize=False, add_generation_prompt=True, return_dict=False\n",
575
+ ")\n",
576
+ "response = requests.post(\n",
577
+ " f\"http://localhost:{port}/generate\",\n",
578
+ " json={\n",
579
+ " \"text\": text,\n",
580
+ " \"sampling_params\": {\n",
581
+ " \"temperature\": 0,\n",
582
+ " \"max_new_tokens\": 64,\n",
583
+ " \"regex\": \"(France|England)\",\n",
584
+ " },\n",
585
+ " },\n",
586
+ ")\n",
587
+ "print_highlight(response.json())"
588
+ ]
589
+ },
590
+ {
591
+ "cell_type": "markdown",
592
+ "metadata": {},
593
+ "source": [
594
+ "### Structural Tag"
595
+ ]
596
+ },
597
+ {
598
+ "cell_type": "code",
599
+ "execution_count": null,
600
+ "metadata": {},
601
+ "outputs": [],
602
+ "source": [
603
+ "from transformers import AutoTokenizer\n",
604
+ "\n",
605
+ "# generate an answer\n",
606
+ "tokenizer = AutoTokenizer.from_pretrained(\"meta-llama/Meta-Llama-3.1-8B-Instruct\")\n",
607
+ "\n",
608
+ "text = tokenizer.apply_chat_template(\n",
609
+ " messages, tokenize=False, add_generation_prompt=True, return_dict=False\n",
610
+ ")\n",
611
+ "payload = {\n",
612
+ " \"text\": text,\n",
613
+ " \"sampling_params\": {\n",
614
+ " \"structural_tag\": json.dumps(\n",
615
+ " {\n",
616
+ " \"type\": \"structural_tag\",\n",
617
+ " \"structures\": [\n",
618
+ " {\n",
619
+ " \"begin\": \"<function=get_current_weather>\",\n",
620
+ " \"schema\": schema_get_current_weather,\n",
621
+ " \"end\": \"</function>\",\n",
622
+ " },\n",
623
+ " {\n",
624
+ " \"begin\": \"<function=get_current_date>\",\n",
625
+ " \"schema\": schema_get_current_date,\n",
626
+ " \"end\": \"</function>\",\n",
627
+ " },\n",
628
+ " ],\n",
629
+ " \"triggers\": [\"<function=\"],\n",
630
+ " }\n",
631
+ " )\n",
632
+ " },\n",
633
+ "}\n",
634
+ "\n",
635
+ "\n",
636
+ "# Send POST request to the API endpoint\n",
637
+ "response = requests.post(f\"http://localhost:{port}/generate\", json=payload)\n",
638
+ "print_highlight(response.json())"
639
+ ]
640
+ },
641
+ {
642
+ "cell_type": "code",
643
+ "execution_count": null,
644
+ "metadata": {},
645
+ "outputs": [],
646
+ "source": [
647
+ "# Support for XGrammar latest structural tag format\n",
648
+ "# https://xgrammar.mlc.ai/docs/tutorials/structural_tag.html\n",
649
+ "\n",
650
+ "payload = {\n",
651
+ " \"text\": text,\n",
652
+ " \"sampling_params\": {\n",
653
+ " \"structural_tag\": json.dumps(\n",
654
+ " {\n",
655
+ " \"type\": \"structural_tag\",\n",
656
+ " \"format\": {\n",
657
+ " \"type\": \"triggered_tags\",\n",
658
+ " \"triggers\": [\"<function=\"],\n",
659
+ " \"tags\": [\n",
660
+ " {\n",
661
+ " \"begin\": \"<function=get_current_weather>\",\n",
662
+ " \"content\": {\n",
663
+ " \"type\": \"json_schema\",\n",
664
+ " \"json_schema\": schema_get_current_weather,\n",
665
+ " },\n",
666
+ " \"end\": \"</function>\",\n",
667
+ " },\n",
668
+ " {\n",
669
+ " \"begin\": \"<function=get_current_date>\",\n",
670
+ " \"content\": {\n",
671
+ " \"type\": \"json_schema\",\n",
672
+ " \"json_schema\": schema_get_current_date,\n",
673
+ " },\n",
674
+ " \"end\": \"</function>\",\n",
675
+ " },\n",
676
+ " ],\n",
677
+ " \"at_least_one\": False,\n",
678
+ " \"stop_after_first\": False,\n",
679
+ " },\n",
680
+ " }\n",
681
+ " )\n",
682
+ " },\n",
683
+ "}\n",
684
+ "\n",
685
+ "\n",
686
+ "# Send POST request to the API endpoint\n",
687
+ "response = requests.post(f\"http://localhost:{port}/generate\", json=payload)\n",
688
+ "print_highlight(response.json())"
689
+ ]
690
+ },
691
+ {
692
+ "cell_type": "code",
693
+ "execution_count": null,
694
+ "metadata": {},
695
+ "outputs": [],
696
+ "source": [
697
+ "terminate_process(server_process)"
698
+ ]
699
+ },
700
+ {
701
+ "cell_type": "markdown",
702
+ "metadata": {},
703
+ "source": [
704
+ "## Offline Engine API"
705
+ ]
706
+ },
707
+ {
708
+ "cell_type": "code",
709
+ "execution_count": null,
710
+ "metadata": {},
711
+ "outputs": [],
712
+ "source": [
713
+ "import sglang as sgl\n",
714
+ "\n",
715
+ "llm = sgl.Engine(\n",
716
+ " model_path=\"meta-llama/Meta-Llama-3.1-8B-Instruct\", grammar_backend=\"xgrammar\"\n",
717
+ ")"
718
+ ]
719
+ },
720
+ {
721
+ "cell_type": "markdown",
722
+ "metadata": {},
723
+ "source": [
724
+ "### JSON"
725
+ ]
726
+ },
727
+ {
728
+ "cell_type": "markdown",
729
+ "metadata": {},
730
+ "source": [
731
+ "**Using Pydantic**"
732
+ ]
733
+ },
734
+ {
735
+ "cell_type": "code",
736
+ "execution_count": null,
737
+ "metadata": {},
738
+ "outputs": [],
739
+ "source": [
740
+ "import json\n",
741
+ "from pydantic import BaseModel, Field\n",
742
+ "\n",
743
+ "prompts = [\n",
744
+ " \"Give me the information of the capital of China in the JSON format.\",\n",
745
+ " \"Give me the information of the capital of France in the JSON format.\",\n",
746
+ " \"Give me the information of the capital of Ireland in the JSON format.\",\n",
747
+ "]\n",
748
+ "\n",
749
+ "\n",
750
+ "# Define the schema using Pydantic\n",
751
+ "class CapitalInfo(BaseModel):\n",
752
+ " name: str = Field(..., pattern=r\"^\\w+$\", description=\"Name of the capital city\")\n",
753
+ " population: int = Field(..., description=\"Population of the capital city\")\n",
754
+ "\n",
755
+ "\n",
756
+ "sampling_params = {\n",
757
+ " \"temperature\": 0.1,\n",
758
+ " \"top_p\": 0.95,\n",
759
+ " \"json_schema\": json.dumps(CapitalInfo.model_json_schema()),\n",
760
+ "}\n",
761
+ "\n",
762
+ "outputs = llm.generate(prompts, sampling_params)\n",
763
+ "for prompt, output in zip(prompts, outputs):\n",
764
+ " print_highlight(\"===============================\")\n",
765
+ " print_highlight(f\"Prompt: {prompt}\") # validate the output by the pydantic model\n",
766
+ " capital_info = CapitalInfo.model_validate_json(output[\"text\"])\n",
767
+ " print_highlight(f\"Validated output: {capital_info.model_dump_json()}\")"
768
+ ]
769
+ },
770
+ {
771
+ "cell_type": "markdown",
772
+ "metadata": {},
773
+ "source": [
774
+ "**JSON Schema Directly**"
775
+ ]
776
+ },
777
+ {
778
+ "cell_type": "code",
779
+ "execution_count": null,
780
+ "metadata": {},
781
+ "outputs": [],
782
+ "source": [
783
+ "prompts = [\n",
784
+ " \"Give me the information of the capital of China in the JSON format.\",\n",
785
+ " \"Give me the information of the capital of France in the JSON format.\",\n",
786
+ " \"Give me the information of the capital of Ireland in the JSON format.\",\n",
787
+ "]\n",
788
+ "\n",
789
+ "json_schema = json.dumps(\n",
790
+ " {\n",
791
+ " \"type\": \"object\",\n",
792
+ " \"properties\": {\n",
793
+ " \"name\": {\"type\": \"string\", \"pattern\": \"^[\\\\w]+$\"},\n",
794
+ " \"population\": {\"type\": \"integer\"},\n",
795
+ " },\n",
796
+ " \"required\": [\"name\", \"population\"],\n",
797
+ " }\n",
798
+ ")\n",
799
+ "\n",
800
+ "sampling_params = {\"temperature\": 0.1, \"top_p\": 0.95, \"json_schema\": json_schema}\n",
801
+ "\n",
802
+ "outputs = llm.generate(prompts, sampling_params)\n",
803
+ "for prompt, output in zip(prompts, outputs):\n",
804
+ " print_highlight(\"===============================\")\n",
805
+ " print_highlight(f\"Prompt: {prompt}\\nGenerated text: {output['text']}\")"
806
+ ]
807
+ },
808
+ {
809
+ "cell_type": "markdown",
810
+ "metadata": {},
811
+ "source": [
812
+ "### EBNF\n"
813
+ ]
814
+ },
815
+ {
816
+ "cell_type": "code",
817
+ "execution_count": null,
818
+ "metadata": {},
819
+ "outputs": [],
820
+ "source": [
821
+ "prompts = [\n",
822
+ " \"Give me the information of the capital of France.\",\n",
823
+ " \"Give me the information of the capital of Germany.\",\n",
824
+ " \"Give me the information of the capital of Italy.\",\n",
825
+ "]\n",
826
+ "\n",
827
+ "sampling_params = {\n",
828
+ " \"temperature\": 0.8,\n",
829
+ " \"top_p\": 0.95,\n",
830
+ " \"ebnf\": (\n",
831
+ " \"root ::= city | description\\n\"\n",
832
+ " 'city ::= \"London\" | \"Paris\" | \"Berlin\" | \"Rome\"\\n'\n",
833
+ " 'description ::= city \" is \" status\\n'\n",
834
+ " 'status ::= \"the capital of \" country\\n'\n",
835
+ " 'country ::= \"England\" | \"France\" | \"Germany\" | \"Italy\"'\n",
836
+ " ),\n",
837
+ "}\n",
838
+ "\n",
839
+ "outputs = llm.generate(prompts, sampling_params)\n",
840
+ "for prompt, output in zip(prompts, outputs):\n",
841
+ " print_highlight(\"===============================\")\n",
842
+ " print_highlight(f\"Prompt: {prompt}\\nGenerated text: {output['text']}\")"
843
+ ]
844
+ },
845
+ {
846
+ "cell_type": "markdown",
847
+ "metadata": {},
848
+ "source": [
849
+ "### Regular expression"
850
+ ]
851
+ },
852
+ {
853
+ "cell_type": "code",
854
+ "execution_count": null,
855
+ "metadata": {},
856
+ "outputs": [],
857
+ "source": [
858
+ "prompts = [\n",
859
+ " \"Please provide information about London as a major global city:\",\n",
860
+ " \"Please provide information about Paris as a major global city:\",\n",
861
+ "]\n",
862
+ "\n",
863
+ "sampling_params = {\"temperature\": 0.8, \"top_p\": 0.95, \"regex\": \"(France|England)\"}\n",
864
+ "\n",
865
+ "outputs = llm.generate(prompts, sampling_params)\n",
866
+ "for prompt, output in zip(prompts, outputs):\n",
867
+ " print_highlight(\"===============================\")\n",
868
+ " print_highlight(f\"Prompt: {prompt}\\nGenerated text: {output['text']}\")"
869
+ ]
870
+ },
871
+ {
872
+ "cell_type": "markdown",
873
+ "metadata": {},
874
+ "source": [
875
+ "### Structural Tag"
876
+ ]
877
+ },
878
+ {
879
+ "cell_type": "code",
880
+ "execution_count": null,
881
+ "metadata": {},
882
+ "outputs": [],
883
+ "source": [
884
+ "text = tokenizer.apply_chat_template(\n",
885
+ " messages, tokenize=False, add_generation_prompt=True, return_dict=False\n",
886
+ ")\n",
887
+ "prompts = [text]\n",
888
+ "\n",
889
+ "\n",
890
+ "sampling_params = {\n",
891
+ " \"temperature\": 0.8,\n",
892
+ " \"top_p\": 0.95,\n",
893
+ " \"structural_tag\": json.dumps(\n",
894
+ " {\n",
895
+ " \"type\": \"structural_tag\",\n",
896
+ " \"structures\": [\n",
897
+ " {\n",
898
+ " \"begin\": \"<function=get_current_weather>\",\n",
899
+ " \"schema\": schema_get_current_weather,\n",
900
+ " \"end\": \"</function>\",\n",
901
+ " },\n",
902
+ " {\n",
903
+ " \"begin\": \"<function=get_current_date>\",\n",
904
+ " \"schema\": schema_get_current_date,\n",
905
+ " \"end\": \"</function>\",\n",
906
+ " },\n",
907
+ " ],\n",
908
+ " \"triggers\": [\"<function=\"],\n",
909
+ " }\n",
910
+ " ),\n",
911
+ "}\n",
912
+ "\n",
913
+ "\n",
914
+ "# Send POST request to the API endpoint\n",
915
+ "outputs = llm.generate(prompts, sampling_params)\n",
916
+ "for prompt, output in zip(prompts, outputs):\n",
917
+ " print_highlight(\"===============================\")\n",
918
+ " print_highlight(f\"Prompt: {prompt}\\nGenerated text: {output['text']}\")"
919
+ ]
920
+ },
921
+ {
922
+ "cell_type": "code",
923
+ "execution_count": null,
924
+ "metadata": {},
925
+ "outputs": [],
926
+ "source": [
927
+ "# Support for XGrammar latest structural tag format\n",
928
+ "# https://xgrammar.mlc.ai/docs/tutorials/structural_tag.html\n",
929
+ "\n",
930
+ "sampling_params = {\n",
931
+ " \"temperature\": 0.8,\n",
932
+ " \"top_p\": 0.95,\n",
933
+ " \"structural_tag\": json.dumps(\n",
934
+ " {\n",
935
+ " \"type\": \"structural_tag\",\n",
936
+ " \"format\": {\n",
937
+ " \"type\": \"triggered_tags\",\n",
938
+ " \"triggers\": [\"<function=\"],\n",
939
+ " \"tags\": [\n",
940
+ " {\n",
941
+ " \"begin\": \"<function=get_current_weather>\",\n",
942
+ " \"content\": {\n",
943
+ " \"type\": \"json_schema\",\n",
944
+ " \"json_schema\": schema_get_current_weather,\n",
945
+ " },\n",
946
+ " \"end\": \"</function>\",\n",
947
+ " },\n",
948
+ " {\n",
949
+ " \"begin\": \"<function=get_current_date>\",\n",
950
+ " \"content\": {\n",
951
+ " \"type\": \"json_schema\",\n",
952
+ " \"json_schema\": schema_get_current_date,\n",
953
+ " },\n",
954
+ " \"end\": \"</function>\",\n",
955
+ " },\n",
956
+ " ],\n",
957
+ " \"at_least_one\": False,\n",
958
+ " \"stop_after_first\": False,\n",
959
+ " },\n",
960
+ " }\n",
961
+ " ),\n",
962
+ "}\n",
963
+ "\n",
964
+ "\n",
965
+ "# Send POST request to the API endpoint\n",
966
+ "outputs = llm.generate(prompts, sampling_params)\n",
967
+ "for prompt, output in zip(prompts, outputs):\n",
968
+ " print_highlight(\"===============================\")\n",
969
+ " print_highlight(f\"Prompt: {prompt}\\nGenerated text: {output['text']}\")"
970
+ ]
971
+ },
972
+ {
973
+ "cell_type": "code",
974
+ "execution_count": null,
975
+ "metadata": {},
976
+ "outputs": [],
977
+ "source": [
978
+ "llm.shutdown()"
979
+ ]
980
+ }
981
+ ],
982
+ "metadata": {
983
+ "language_info": {
984
+ "codemirror_mode": {
985
+ "name": "ipython",
986
+ "version": 3
987
+ },
988
+ "file_extension": ".py",
989
+ "mimetype": "text/x-python",
990
+ "name": "python",
991
+ "nbconvert_exporter": "python",
992
+ "pygments_lexer": "ipython3"
993
+ }
994
+ },
995
+ "nbformat": 4,
996
+ "nbformat_minor": 2
997
+ }
sglang/docs/advanced_features/tool_parser.ipynb ADDED
@@ -0,0 +1,856 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "metadata": {},
6
+ "source": [
7
+ "# Tool Parser\n",
8
+ "\n",
9
+ "This guide demonstrates how to use SGLang’s [Function calling](https://platform.openai.com/docs/guides/function-calling) functionality."
10
+ ]
11
+ },
12
+ {
13
+ "cell_type": "markdown",
14
+ "metadata": {},
15
+ "source": [
16
+ "## Currently supported parsers:\n",
17
+ "\n",
18
+ "| Parser | Supported Models | Notes |\n",
19
+ "|---|---|---|\n",
20
+ "| `deepseekv3` | DeepSeek-v3 (e.g., `deepseek-ai/DeepSeek-V3-0324`) | Recommend adding `--chat-template ./examples/chat_template/tool_chat_template_deepseekv3.jinja` to launch command. |\n",
21
+ "| `deepseekv31` | DeepSeek-V3.1 and DeepSeek-V3.2-Exp (e.g. `deepseek-ai/DeepSeek-V3.1`, `deepseek-ai/DeepSeek-V3.2-Exp`) | Recommend adding `--chat-template ./examples/chat_template/tool_chat_template_deepseekv31.jinja` (Or ..deepseekv32.jinja for DeepSeek-V3.2) to launch command. |\n",
22
+ "| `deepseekv32` | DeepSeek-V3.2 (`deepseek-ai/DeepSeek-V3.2`) | |\n",
23
+ "| `glm` | GLM series (e.g. `zai-org/GLM-4.6`) | |\n",
24
+ "| `gpt-oss` | GPT-OSS (e.g., `openai/gpt-oss-120b`, `openai/gpt-oss-20b`, `lmsys/gpt-oss-120b-bf16`, `lmsys/gpt-oss-20b-bf16`) | The gpt-oss tool parser filters out analysis channel events and only preserves normal text. This can cause the content to be empty when explanations are in the analysis channel. To work around this, complete the tool round by returning tool results as `role=\"tool\"` messages, which enables the model to generate the final content. |\n",
25
+ "| `kimi_k2` | `moonshotai/Kimi-K2-Instruct` | |\n",
26
+ "| `llama3` | Llama 3.1 / 3.2 / 3.3 (e.g. `meta-llama/Llama-3.1-8B-Instruct`, `meta-llama/Llama-3.2-1B-Instruct`, `meta-llama/Llama-3.3-70B-Instruct`) | |\n",
27
+ "| `llama4` | Llama 4 (e.g. `meta-llama/Llama-4-Scout-17B-16E-Instruct`) | |\n",
28
+ "| `mistral` | Mistral (e.g. `mistralai/Mistral-7B-Instruct-v0.3`, `mistralai/Mistral-Nemo-Instruct-2407`, `mistralai/Mistral-7B-v0.3`) | |\n",
29
+ "| `pythonic` | Llama-3.2 / Llama-3.3 / Llama-4 | Model outputs function calls as Python code. Requires `--tool-call-parser pythonic` and is recommended to use with a specific chat template. |\n",
30
+ "| `qwen` | Qwen series (e.g. `Qwen/Qwen3-Next-80B-A3B-Instruct`, `Qwen/Qwen3-VL-30B-A3B-Thinking`) except Qwen3-Coder| |\n",
31
+ "| `qwen3_coder` | Qwen3-Coder (e.g. `Qwen/Qwen3-Coder-30B-A3B-Instruct`) | |\n",
32
+ "| `step3` | Step-3 | |\n"
33
+ ]
34
+ },
35
+ {
36
+ "cell_type": "markdown",
37
+ "metadata": {},
38
+ "source": [
39
+ "## OpenAI Compatible API"
40
+ ]
41
+ },
42
+ {
43
+ "cell_type": "markdown",
44
+ "metadata": {},
45
+ "source": [
46
+ "### Launching the Server"
47
+ ]
48
+ },
49
+ {
50
+ "cell_type": "code",
51
+ "execution_count": null,
52
+ "metadata": {},
53
+ "outputs": [],
54
+ "source": [
55
+ "import json\n",
56
+ "from sglang.test.doc_patch import launch_server_cmd\n",
57
+ "from sglang.utils import wait_for_server, print_highlight, terminate_process\n",
58
+ "from openai import OpenAI\n",
59
+ "\n",
60
+ "server_process, port = launch_server_cmd(\n",
61
+ " \"python3 -m sglang.launch_server --model-path Qwen/Qwen2.5-7B-Instruct --tool-call-parser qwen25 --host 0.0.0.0 --log-level warning\" # qwen25\n",
62
+ ")\n",
63
+ "wait_for_server(f\"http://localhost:{port}\", process=server_process)"
64
+ ]
65
+ },
66
+ {
67
+ "cell_type": "markdown",
68
+ "metadata": {},
69
+ "source": [
70
+ "Note that `--tool-call-parser` defines the parser used to interpret responses."
71
+ ]
72
+ },
73
+ {
74
+ "cell_type": "markdown",
75
+ "metadata": {},
76
+ "source": [
77
+ "### Define Tools for Function Call\n",
78
+ "Below is a Python snippet that shows how to define a tool as a dictionary. The dictionary includes a tool name, a description, and property defined Parameters."
79
+ ]
80
+ },
81
+ {
82
+ "cell_type": "code",
83
+ "execution_count": null,
84
+ "metadata": {},
85
+ "outputs": [],
86
+ "source": [
87
+ "# Define tools\n",
88
+ "tools = [\n",
89
+ " {\n",
90
+ " \"type\": \"function\",\n",
91
+ " \"function\": {\n",
92
+ " \"name\": \"get_current_weather\",\n",
93
+ " \"description\": \"Get the current weather in a given location\",\n",
94
+ " \"parameters\": {\n",
95
+ " \"type\": \"object\",\n",
96
+ " \"properties\": {\n",
97
+ " \"city\": {\n",
98
+ " \"type\": \"string\",\n",
99
+ " \"description\": \"The city to find the weather for, e.g. 'San Francisco'\",\n",
100
+ " },\n",
101
+ " \"state\": {\n",
102
+ " \"type\": \"string\",\n",
103
+ " \"description\": \"the two-letter abbreviation for the state that the city is\"\n",
104
+ " \" in, e.g. 'CA' which would mean 'California'\",\n",
105
+ " },\n",
106
+ " \"unit\": {\n",
107
+ " \"type\": \"string\",\n",
108
+ " \"description\": \"The unit to fetch the temperature in\",\n",
109
+ " \"enum\": [\"celsius\", \"fahrenheit\"],\n",
110
+ " },\n",
111
+ " },\n",
112
+ " \"required\": [\"city\", \"state\", \"unit\"],\n",
113
+ " },\n",
114
+ " },\n",
115
+ " }\n",
116
+ "]"
117
+ ]
118
+ },
119
+ {
120
+ "cell_type": "markdown",
121
+ "metadata": {},
122
+ "source": [
123
+ "### Define Messages"
124
+ ]
125
+ },
126
+ {
127
+ "cell_type": "code",
128
+ "execution_count": null,
129
+ "metadata": {},
130
+ "outputs": [],
131
+ "source": [
132
+ "def get_messages():\n",
133
+ " return [\n",
134
+ " {\n",
135
+ " \"role\": \"user\",\n",
136
+ " \"content\": \"What's the weather like in Boston today? Output a reasoning before act, then use the tools to help you.\",\n",
137
+ " }\n",
138
+ " ]\n",
139
+ "\n",
140
+ "\n",
141
+ "messages = get_messages()"
142
+ ]
143
+ },
144
+ {
145
+ "cell_type": "markdown",
146
+ "metadata": {},
147
+ "source": [
148
+ "### Initialize the Client"
149
+ ]
150
+ },
151
+ {
152
+ "cell_type": "code",
153
+ "execution_count": null,
154
+ "metadata": {},
155
+ "outputs": [],
156
+ "source": [
157
+ "# Initialize OpenAI-like client\n",
158
+ "client = OpenAI(api_key=\"None\", base_url=f\"http://0.0.0.0:{port}/v1\")\n",
159
+ "model_name = client.models.list().data[0].id"
160
+ ]
161
+ },
162
+ {
163
+ "cell_type": "markdown",
164
+ "metadata": {},
165
+ "source": [
166
+ "### Non-Streaming Request"
167
+ ]
168
+ },
169
+ {
170
+ "cell_type": "code",
171
+ "execution_count": null,
172
+ "metadata": {},
173
+ "outputs": [],
174
+ "source": [
175
+ "# Non-streaming mode test\n",
176
+ "response_non_stream = client.chat.completions.create(\n",
177
+ " model=model_name,\n",
178
+ " messages=messages,\n",
179
+ " temperature=0,\n",
180
+ " top_p=0.95,\n",
181
+ " max_tokens=1024,\n",
182
+ " stream=False, # Non-streaming\n",
183
+ " tools=tools,\n",
184
+ ")\n",
185
+ "print_highlight(\"Non-stream response:\")\n",
186
+ "print_highlight(response_non_stream)\n",
187
+ "print_highlight(\"==== content ====\")\n",
188
+ "print_highlight(response_non_stream.choices[0].message.content)\n",
189
+ "print_highlight(\"==== tool_calls ====\")\n",
190
+ "print_highlight(response_non_stream.choices[0].message.tool_calls)"
191
+ ]
192
+ },
193
+ {
194
+ "cell_type": "markdown",
195
+ "metadata": {},
196
+ "source": [
197
+ "#### Handle Tools\n",
198
+ "When the engine determines it should call a particular tool, it will return arguments or partial arguments through the response. You can parse these arguments and later invoke the tool accordingly."
199
+ ]
200
+ },
201
+ {
202
+ "cell_type": "code",
203
+ "execution_count": null,
204
+ "metadata": {},
205
+ "outputs": [],
206
+ "source": [
207
+ "name_non_stream = response_non_stream.choices[0].message.tool_calls[0].function.name\n",
208
+ "arguments_non_stream = (\n",
209
+ " response_non_stream.choices[0].message.tool_calls[0].function.arguments\n",
210
+ ")\n",
211
+ "\n",
212
+ "print_highlight(f\"Final streamed function call name: {name_non_stream}\")\n",
213
+ "print_highlight(f\"Final streamed function call arguments: {arguments_non_stream}\")"
214
+ ]
215
+ },
216
+ {
217
+ "cell_type": "markdown",
218
+ "metadata": {},
219
+ "source": [
220
+ "### Streaming Request"
221
+ ]
222
+ },
223
+ {
224
+ "cell_type": "code",
225
+ "execution_count": null,
226
+ "metadata": {},
227
+ "outputs": [],
228
+ "source": [
229
+ "# Streaming mode test\n",
230
+ "print_highlight(\"Streaming response:\")\n",
231
+ "response_stream = client.chat.completions.create(\n",
232
+ " model=model_name,\n",
233
+ " messages=messages,\n",
234
+ " temperature=0,\n",
235
+ " top_p=0.95,\n",
236
+ " max_tokens=1024,\n",
237
+ " stream=True, # Enable streaming\n",
238
+ " tools=tools,\n",
239
+ ")\n",
240
+ "\n",
241
+ "texts = \"\"\n",
242
+ "tool_calls = []\n",
243
+ "name = \"\"\n",
244
+ "arguments = \"\"\n",
245
+ "for chunk in response_stream:\n",
246
+ " if chunk.choices[0].delta.content:\n",
247
+ " texts += chunk.choices[0].delta.content\n",
248
+ " if chunk.choices[0].delta.tool_calls:\n",
249
+ " tool_calls.append(chunk.choices[0].delta.tool_calls[0])\n",
250
+ "print_highlight(\"==== Text ====\")\n",
251
+ "print_highlight(texts)\n",
252
+ "\n",
253
+ "print_highlight(\"==== Tool Call ====\")\n",
254
+ "for tool_call in tool_calls:\n",
255
+ " print_highlight(tool_call)"
256
+ ]
257
+ },
258
+ {
259
+ "cell_type": "markdown",
260
+ "metadata": {},
261
+ "source": [
262
+ "#### Handle Tools\n",
263
+ "When the engine determines it should call a particular tool, it will return arguments or partial arguments through the response. You can parse these arguments and later invoke the tool accordingly."
264
+ ]
265
+ },
266
+ {
267
+ "cell_type": "code",
268
+ "execution_count": null,
269
+ "metadata": {},
270
+ "outputs": [],
271
+ "source": [
272
+ "# Parse and combine function call arguments\n",
273
+ "arguments = []\n",
274
+ "for tool_call in tool_calls:\n",
275
+ " if tool_call.function.name:\n",
276
+ " print_highlight(f\"Streamed function call name: {tool_call.function.name}\")\n",
277
+ "\n",
278
+ " if tool_call.function.arguments:\n",
279
+ " arguments.append(tool_call.function.arguments)\n",
280
+ "\n",
281
+ "# Combine all fragments into a single JSON string\n",
282
+ "full_arguments = \"\".join(arguments)\n",
283
+ "print_highlight(f\"streamed function call arguments: {full_arguments}\")"
284
+ ]
285
+ },
286
+ {
287
+ "cell_type": "markdown",
288
+ "metadata": {},
289
+ "source": [
290
+ "### Define a Tool Function"
291
+ ]
292
+ },
293
+ {
294
+ "cell_type": "code",
295
+ "execution_count": null,
296
+ "metadata": {},
297
+ "outputs": [],
298
+ "source": [
299
+ "# This is a demonstration, define real function according to your usage.\n",
300
+ "def get_current_weather(city: str, state: str, unit: \"str\"):\n",
301
+ " return (\n",
302
+ " f\"The weather in {city}, {state} is 85 degrees {unit}. It is \"\n",
303
+ " \"partly cloudly, with highs in the 90's.\"\n",
304
+ " )\n",
305
+ "\n",
306
+ "\n",
307
+ "available_tools = {\"get_current_weather\": get_current_weather}"
308
+ ]
309
+ },
310
+ {
311
+ "cell_type": "markdown",
312
+ "metadata": {},
313
+ "source": [
314
+ "\n",
315
+ "### Execute the Tool"
316
+ ]
317
+ },
318
+ {
319
+ "cell_type": "code",
320
+ "execution_count": null,
321
+ "metadata": {},
322
+ "outputs": [],
323
+ "source": [
324
+ "messages.append(response_non_stream.choices[0].message)\n",
325
+ "\n",
326
+ "# Call the corresponding tool function\n",
327
+ "tool_call = messages[-1].tool_calls[0]\n",
328
+ "tool_name = tool_call.function.name\n",
329
+ "tool_to_call = available_tools[tool_name]\n",
330
+ "result = tool_to_call(**(json.loads(tool_call.function.arguments)))\n",
331
+ "print_highlight(f\"Function call result: {result}\")\n",
332
+ "# messages.append({\"role\": \"tool\", \"content\": result, \"name\": tool_name})\n",
333
+ "messages.append(\n",
334
+ " {\n",
335
+ " \"role\": \"tool\",\n",
336
+ " \"tool_call_id\": tool_call.id,\n",
337
+ " \"content\": str(result),\n",
338
+ " \"name\": tool_name,\n",
339
+ " }\n",
340
+ ")\n",
341
+ "\n",
342
+ "print_highlight(f\"Updated message history: {messages}\")"
343
+ ]
344
+ },
345
+ {
346
+ "cell_type": "markdown",
347
+ "metadata": {},
348
+ "source": [
349
+ "### Send Results Back to Model"
350
+ ]
351
+ },
352
+ {
353
+ "cell_type": "code",
354
+ "execution_count": null,
355
+ "metadata": {},
356
+ "outputs": [],
357
+ "source": [
358
+ "final_response = client.chat.completions.create(\n",
359
+ " model=model_name,\n",
360
+ " messages=messages,\n",
361
+ " temperature=0,\n",
362
+ " top_p=0.95,\n",
363
+ " stream=False,\n",
364
+ " tools=tools,\n",
365
+ ")\n",
366
+ "print_highlight(\"Non-stream response:\")\n",
367
+ "print_highlight(final_response)\n",
368
+ "\n",
369
+ "print_highlight(\"==== Text ====\")\n",
370
+ "print_highlight(final_response.choices[0].message.content)"
371
+ ]
372
+ },
373
+ {
374
+ "cell_type": "markdown",
375
+ "metadata": {},
376
+ "source": [
377
+ "## Native API and SGLang Runtime (SRT)"
378
+ ]
379
+ },
380
+ {
381
+ "cell_type": "code",
382
+ "execution_count": null,
383
+ "metadata": {},
384
+ "outputs": [],
385
+ "source": [
386
+ "from transformers import AutoTokenizer\n",
387
+ "import requests\n",
388
+ "\n",
389
+ "# generate an answer\n",
390
+ "tokenizer = AutoTokenizer.from_pretrained(\"Qwen/Qwen2.5-7B-Instruct\")\n",
391
+ "\n",
392
+ "messages = get_messages()\n",
393
+ "\n",
394
+ "input = tokenizer.apply_chat_template(\n",
395
+ " messages, tokenize=False, add_generation_prompt=True, tools=tools, return_dict=False\n",
396
+ ")\n",
397
+ "\n",
398
+ "gen_url = f\"http://localhost:{port}/generate\"\n",
399
+ "gen_data = {\n",
400
+ " \"text\": input,\n",
401
+ " \"sampling_params\": {\n",
402
+ " \"skip_special_tokens\": False,\n",
403
+ " \"max_new_tokens\": 1024,\n",
404
+ " \"temperature\": 0,\n",
405
+ " \"top_p\": 0.95,\n",
406
+ " },\n",
407
+ "}\n",
408
+ "gen_response = requests.post(gen_url, json=gen_data).json()[\"text\"]\n",
409
+ "print_highlight(\"==== Response ====\")\n",
410
+ "print_highlight(gen_response)\n",
411
+ "\n",
412
+ "# parse the response\n",
413
+ "parse_url = f\"http://localhost:{port}/parse_function_call\"\n",
414
+ "\n",
415
+ "function_call_input = {\n",
416
+ " \"text\": gen_response,\n",
417
+ " \"tool_call_parser\": \"qwen25\",\n",
418
+ " \"tools\": tools,\n",
419
+ "}\n",
420
+ "\n",
421
+ "function_call_response = requests.post(parse_url, json=function_call_input)\n",
422
+ "function_call_response_json = function_call_response.json()\n",
423
+ "\n",
424
+ "print_highlight(\"==== Text ====\")\n",
425
+ "print(function_call_response_json[\"normal_text\"])\n",
426
+ "print_highlight(\"==== Calls ====\")\n",
427
+ "print(\"function name: \", function_call_response_json[\"calls\"][0][\"name\"])\n",
428
+ "print(\"function arguments: \", function_call_response_json[\"calls\"][0][\"parameters\"])"
429
+ ]
430
+ },
431
+ {
432
+ "cell_type": "code",
433
+ "execution_count": null,
434
+ "metadata": {},
435
+ "outputs": [],
436
+ "source": [
437
+ "terminate_process(server_process)"
438
+ ]
439
+ },
440
+ {
441
+ "cell_type": "markdown",
442
+ "metadata": {},
443
+ "source": [
444
+ "## Offline Engine API"
445
+ ]
446
+ },
447
+ {
448
+ "cell_type": "code",
449
+ "execution_count": null,
450
+ "metadata": {},
451
+ "outputs": [],
452
+ "source": [
453
+ "import sglang as sgl\n",
454
+ "from sglang.srt.function_call.function_call_parser import FunctionCallParser\n",
455
+ "from sglang.srt.managers.io_struct import Tool, Function\n",
456
+ "\n",
457
+ "llm = sgl.Engine(model_path=\"Qwen/Qwen2.5-7B-Instruct\")\n",
458
+ "tokenizer = llm.tokenizer_manager.tokenizer\n",
459
+ "input_ids = tokenizer.apply_chat_template(\n",
460
+ " messages, tokenize=True, add_generation_prompt=True, tools=tools, return_dict=False\n",
461
+ ")\n",
462
+ "\n",
463
+ "# Note that for gpt-oss tool parser, adding \"no_stop_trim\": True\n",
464
+ "# to make sure the tool call token <call> is not trimmed.\n",
465
+ "\n",
466
+ "sampling_params = {\n",
467
+ " \"max_new_tokens\": 1024,\n",
468
+ " \"temperature\": 0,\n",
469
+ " \"top_p\": 0.95,\n",
470
+ " \"skip_special_tokens\": False,\n",
471
+ "}\n",
472
+ "\n",
473
+ "# 1) Offline generation\n",
474
+ "result = llm.generate(input_ids=input_ids, sampling_params=sampling_params)\n",
475
+ "generated_text = result[\"text\"] # Assume there is only one prompt\n",
476
+ "\n",
477
+ "print_highlight(\"=== Offline Engine Output Text ===\")\n",
478
+ "print_highlight(generated_text)\n",
479
+ "\n",
480
+ "\n",
481
+ "# 2) Parse using FunctionCallParser\n",
482
+ "def convert_dict_to_tool(tool_dict: dict) -> Tool:\n",
483
+ " function_dict = tool_dict.get(\"function\", {})\n",
484
+ " return Tool(\n",
485
+ " type=tool_dict.get(\"type\", \"function\"),\n",
486
+ " function=Function(\n",
487
+ " name=function_dict.get(\"name\"),\n",
488
+ " description=function_dict.get(\"description\"),\n",
489
+ " parameters=function_dict.get(\"parameters\"),\n",
490
+ " ),\n",
491
+ " )\n",
492
+ "\n",
493
+ "\n",
494
+ "tools = [convert_dict_to_tool(raw_tool) for raw_tool in tools]\n",
495
+ "\n",
496
+ "parser = FunctionCallParser(tools=tools, tool_call_parser=\"qwen25\")\n",
497
+ "normal_text, calls = parser.parse_non_stream(generated_text)\n",
498
+ "\n",
499
+ "print_highlight(\"=== Parsing Result ===\")\n",
500
+ "print(\"Normal text portion:\", normal_text)\n",
501
+ "print_highlight(\"Function call portion:\")\n",
502
+ "for call in calls:\n",
503
+ " # call: ToolCallItem\n",
504
+ " print_highlight(f\" - tool name: {call.name}\")\n",
505
+ " print_highlight(f\" parameters: {call.parameters}\")\n",
506
+ "\n",
507
+ "# 3) If needed, perform additional logic on the parsed functions, such as automatically calling the corresponding function to obtain a return value, etc."
508
+ ]
509
+ },
510
+ {
511
+ "cell_type": "code",
512
+ "execution_count": null,
513
+ "metadata": {},
514
+ "outputs": [],
515
+ "source": [
516
+ "llm.shutdown()"
517
+ ]
518
+ },
519
+ {
520
+ "cell_type": "markdown",
521
+ "metadata": {},
522
+ "source": [
523
+ "## Tool Choice Mode\n",
524
+ "\n",
525
+ "SGLang supports OpenAI's `tool_choice` parameter to control when and which tools the model should call. This feature is implemented using EBNF (Extended Backus-Naur Form) grammar to ensure reliable tool calling behavior.\n",
526
+ "\n",
527
+ "### Supported Tool Choice Options\n",
528
+ "\n",
529
+ "- **`tool_choice=\"required\"`**: Forces the model to call at least one tool\n",
530
+ "- **`tool_choice={\"type\": \"function\", \"function\": {\"name\": \"specific_function\"}}`**: Forces the model to call a specific function\n",
531
+ "\n",
532
+ "### Backend Compatibility\n",
533
+ "\n",
534
+ "Tool choice is fully supported with the **Xgrammar backend**, which is the default grammar backend (`--grammar-backend xgrammar`). However, it may not be fully supported with other backends such as `outlines`.\n",
535
+ "\n",
536
+ "### Example: Required Tool Choice"
537
+ ]
538
+ },
539
+ {
540
+ "cell_type": "code",
541
+ "execution_count": null,
542
+ "metadata": {},
543
+ "outputs": [],
544
+ "source": [
545
+ "from openai import OpenAI\n",
546
+ "from sglang.utils import wait_for_server, print_highlight, terminate_process\n",
547
+ "from sglang.test.doc_patch import launch_server_cmd\n",
548
+ "\n",
549
+ "# Start a new server session for tool choice examples\n",
550
+ "server_process_tool_choice, port_tool_choice = launch_server_cmd(\n",
551
+ " \"python3 -m sglang.launch_server --model-path Qwen/Qwen2.5-7B-Instruct --tool-call-parser qwen25 --host 0.0.0.0 --log-level warning\"\n",
552
+ ")\n",
553
+ "wait_for_server(\n",
554
+ " f\"http://localhost:{port_tool_choice}\", process=server_process_tool_choice\n",
555
+ ")\n",
556
+ "\n",
557
+ "# Initialize client for tool choice examples\n",
558
+ "client_tool_choice = OpenAI(\n",
559
+ " api_key=\"None\", base_url=f\"http://0.0.0.0:{port_tool_choice}/v1\"\n",
560
+ ")\n",
561
+ "model_name_tool_choice = client_tool_choice.models.list().data[0].id\n",
562
+ "\n",
563
+ "# Example with tool_choice=\"required\" - forces the model to call a tool\n",
564
+ "messages_required = [\n",
565
+ " {\"role\": \"user\", \"content\": \"Hello, what is the capital of France?\"}\n",
566
+ "]\n",
567
+ "\n",
568
+ "# Define tools\n",
569
+ "tools = [\n",
570
+ " {\n",
571
+ " \"type\": \"function\",\n",
572
+ " \"function\": {\n",
573
+ " \"name\": \"get_current_weather\",\n",
574
+ " \"description\": \"Get the current weather in a given location\",\n",
575
+ " \"parameters\": {\n",
576
+ " \"type\": \"object\",\n",
577
+ " \"properties\": {\n",
578
+ " \"city\": {\n",
579
+ " \"type\": \"string\",\n",
580
+ " \"description\": \"The city to find the weather for, e.g. 'San Francisco'\",\n",
581
+ " },\n",
582
+ " \"unit\": {\n",
583
+ " \"type\": \"string\",\n",
584
+ " \"description\": \"The unit to fetch the temperature in\",\n",
585
+ " \"enum\": [\"celsius\", \"fahrenheit\"],\n",
586
+ " },\n",
587
+ " },\n",
588
+ " \"required\": [\"city\", \"unit\"],\n",
589
+ " },\n",
590
+ " },\n",
591
+ " }\n",
592
+ "]\n",
593
+ "\n",
594
+ "response_required = client_tool_choice.chat.completions.create(\n",
595
+ " model=model_name_tool_choice,\n",
596
+ " messages=messages_required,\n",
597
+ " temperature=0,\n",
598
+ " max_tokens=1024,\n",
599
+ " tools=tools,\n",
600
+ " tool_choice=\"required\", # Force the model to call a tool\n",
601
+ ")\n",
602
+ "\n",
603
+ "print_highlight(\"Response with tool_choice='required':\")\n",
604
+ "print(\"Content:\", response_required.choices[0].message.content)\n",
605
+ "print(\"Tool calls:\", response_required.choices[0].message.tool_calls)"
606
+ ]
607
+ },
608
+ {
609
+ "cell_type": "markdown",
610
+ "metadata": {},
611
+ "source": [
612
+ "### Example: Specific Function Choice\n"
613
+ ]
614
+ },
615
+ {
616
+ "cell_type": "code",
617
+ "execution_count": null,
618
+ "metadata": {},
619
+ "outputs": [],
620
+ "source": [
621
+ "# Example with specific function choice - forces the model to call a specific function\n",
622
+ "messages_specific = [\n",
623
+ " {\"role\": \"user\", \"content\": \"What are the most attactive places in France?\"}\n",
624
+ "]\n",
625
+ "\n",
626
+ "response_specific = client_tool_choice.chat.completions.create(\n",
627
+ " model=model_name_tool_choice,\n",
628
+ " messages=messages_specific,\n",
629
+ " temperature=0,\n",
630
+ " max_tokens=1024,\n",
631
+ " tools=tools,\n",
632
+ " tool_choice={\n",
633
+ " \"type\": \"function\",\n",
634
+ " \"function\": {\"name\": \"get_current_weather\"},\n",
635
+ " }, # Force the model to call the specific get_current_weather function\n",
636
+ ")\n",
637
+ "\n",
638
+ "print_highlight(\"Response with specific function choice:\")\n",
639
+ "print(\"Content:\", response_specific.choices[0].message.content)\n",
640
+ "print(\"Tool calls:\", response_specific.choices[0].message.tool_calls)\n",
641
+ "\n",
642
+ "if response_specific.choices[0].message.tool_calls:\n",
643
+ " tool_call = response_specific.choices[0].message.tool_calls[0]\n",
644
+ " print_highlight(f\"Called function: {tool_call.function.name}\")\n",
645
+ " print_highlight(f\"Arguments: {tool_call.function.arguments}\")"
646
+ ]
647
+ },
648
+ {
649
+ "cell_type": "code",
650
+ "execution_count": null,
651
+ "metadata": {},
652
+ "outputs": [],
653
+ "source": [
654
+ "terminate_process(server_process_tool_choice)"
655
+ ]
656
+ },
657
+ {
658
+ "cell_type": "markdown",
659
+ "metadata": {},
660
+ "source": [
661
+ "## Pythonic Tool Call Format (Llama-3.2 / Llama-3.3 / Llama-4)\n",
662
+ "\n",
663
+ "Some Llama models (such as Llama-3.2-1B, Llama-3.2-3B, Llama-3.3-70B, and Llama-4) support a \"pythonic\" tool call format, where the model outputs function calls as Python code, e.g.:\n",
664
+ "\n",
665
+ "```python\n",
666
+ "[get_current_weather(city=\"San Francisco\", state=\"CA\", unit=\"celsius\")]\n",
667
+ "```\n",
668
+ "\n",
669
+ "- The output is a Python list of function calls, with arguments as Python literals (not JSON).\n",
670
+ "- Multiple tool calls can be returned in the same list:\n",
671
+ "```python\n",
672
+ "[get_current_weather(city=\"San Francisco\", state=\"CA\", unit=\"celsius\"),\n",
673
+ " get_current_weather(city=\"New York\", state=\"NY\", unit=\"fahrenheit\")]\n",
674
+ "```\n",
675
+ "\n",
676
+ "For more information, refer to Meta’s documentation on [Zero shot function calling](https://github.com/meta-llama/llama-models/blob/main/models/llama4/prompt_format.md#zero-shot-function-calling---system-message).\n",
677
+ "\n",
678
+ "Note that this feature is still under development on Blackwell.\n",
679
+ "\n",
680
+ "### How to enable\n",
681
+ "- Launch the server with `--tool-call-parser pythonic`\n",
682
+ "- You may also specify --chat-template with the improved template for the model (e.g., `--chat-template=examples/chat_template/tool_chat_template_llama4_pythonic.jinja`).\n",
683
+ "This is recommended because the model expects a special prompt format to reliably produce valid pythonic tool call outputs. The template ensures that the prompt structure (e.g., special tokens, message boundaries like `<|eom|>`, and function call delimiters) matches what the model was trained or fine-tuned on. If you do not use the correct chat template, tool calling may fail or produce inconsistent results.\n",
684
+ "\n",
685
+ "#### Forcing Pythonic Tool Call Output Without a Chat Template\n",
686
+ "If you don't want to specify a chat template, you must give the model extremely explicit instructions in your messages to enforce pythonic output. For example, for `Llama-3.2-1B-Instruct`, you need:"
687
+ ]
688
+ },
689
+ {
690
+ "cell_type": "code",
691
+ "execution_count": null,
692
+ "metadata": {},
693
+ "outputs": [],
694
+ "source": [
695
+ "import openai\n",
696
+ "\n",
697
+ "server_process, port = launch_server_cmd(\n",
698
+ " \" python3 -m sglang.launch_server --model-path meta-llama/Llama-3.2-1B-Instruct --tool-call-parser pythonic --tp 1 --log-level warning\" # llama-3.2-1b-instruct\n",
699
+ ")\n",
700
+ "wait_for_server(f\"http://localhost:{port}\", process=server_process)\n",
701
+ "\n",
702
+ "tools = [\n",
703
+ " {\n",
704
+ " \"type\": \"function\",\n",
705
+ " \"function\": {\n",
706
+ " \"name\": \"get_weather\",\n",
707
+ " \"description\": \"Get the current weather for a given location.\",\n",
708
+ " \"parameters\": {\n",
709
+ " \"type\": \"object\",\n",
710
+ " \"properties\": {\n",
711
+ " \"location\": {\n",
712
+ " \"type\": \"string\",\n",
713
+ " \"description\": \"The name of the city or location.\",\n",
714
+ " }\n",
715
+ " },\n",
716
+ " \"required\": [\"location\"],\n",
717
+ " },\n",
718
+ " },\n",
719
+ " },\n",
720
+ " {\n",
721
+ " \"type\": \"function\",\n",
722
+ " \"function\": {\n",
723
+ " \"name\": \"get_tourist_attractions\",\n",
724
+ " \"description\": \"Get a list of top tourist attractions for a given city.\",\n",
725
+ " \"parameters\": {\n",
726
+ " \"type\": \"object\",\n",
727
+ " \"properties\": {\n",
728
+ " \"city\": {\n",
729
+ " \"type\": \"string\",\n",
730
+ " \"description\": \"The name of the city to find attractions for.\",\n",
731
+ " }\n",
732
+ " },\n",
733
+ " \"required\": [\"city\"],\n",
734
+ " },\n",
735
+ " },\n",
736
+ " },\n",
737
+ "]\n",
738
+ "\n",
739
+ "\n",
740
+ "def get_messages():\n",
741
+ " return [\n",
742
+ " {\n",
743
+ " \"role\": \"system\",\n",
744
+ " \"content\": (\n",
745
+ " \"You are a travel assistant. \"\n",
746
+ " \"When asked to call functions, ALWAYS respond ONLY with a python list of function calls, \"\n",
747
+ " \"using this format: [func_name1(param1=value1, param2=value2), func_name2(param=value)]. \"\n",
748
+ " \"Do NOT use JSON, do NOT use variables, do NOT use any other format. \"\n",
749
+ " \"Here is an example:\\n\"\n",
750
+ " '[get_weather(location=\"Paris\"), get_tourist_attractions(city=\"Paris\")]'\n",
751
+ " ),\n",
752
+ " },\n",
753
+ " {\n",
754
+ " \"role\": \"user\",\n",
755
+ " \"content\": (\n",
756
+ " \"I'm planning a trip to Tokyo next week. What's the weather like and what are some top tourist attractions? \"\n",
757
+ " \"Propose parallel tool calls at once, using the python list of function calls format as shown above.\"\n",
758
+ " ),\n",
759
+ " },\n",
760
+ " ]\n",
761
+ "\n",
762
+ "\n",
763
+ "messages = get_messages()\n",
764
+ "\n",
765
+ "client = openai.Client(base_url=f\"http://localhost:{port}/v1\", api_key=\"xxxxxx\")\n",
766
+ "model_name = client.models.list().data[0].id\n",
767
+ "\n",
768
+ "\n",
769
+ "response_non_stream = client.chat.completions.create(\n",
770
+ " model=model_name,\n",
771
+ " messages=messages,\n",
772
+ " temperature=0,\n",
773
+ " top_p=0.9,\n",
774
+ " stream=False, # Non-streaming\n",
775
+ " tools=tools,\n",
776
+ ")\n",
777
+ "print_highlight(\"Non-stream response:\")\n",
778
+ "print_highlight(response_non_stream)\n",
779
+ "\n",
780
+ "response_stream = client.chat.completions.create(\n",
781
+ " model=model_name,\n",
782
+ " messages=messages,\n",
783
+ " temperature=0,\n",
784
+ " top_p=0.9,\n",
785
+ " stream=True,\n",
786
+ " tools=tools,\n",
787
+ ")\n",
788
+ "texts = \"\"\n",
789
+ "tool_calls = []\n",
790
+ "name = \"\"\n",
791
+ "arguments = \"\"\n",
792
+ "\n",
793
+ "for chunk in response_stream:\n",
794
+ " if chunk.choices[0].delta.content:\n",
795
+ " texts += chunk.choices[0].delta.content\n",
796
+ " if chunk.choices[0].delta.tool_calls:\n",
797
+ " tool_calls.append(chunk.choices[0].delta.tool_calls[0])\n",
798
+ "\n",
799
+ "print_highlight(\"Streaming Response:\")\n",
800
+ "print_highlight(\"==== Text ====\")\n",
801
+ "print_highlight(texts)\n",
802
+ "\n",
803
+ "print_highlight(\"==== Tool Call ====\")\n",
804
+ "for tool_call in tool_calls:\n",
805
+ " print_highlight(tool_call)\n",
806
+ "\n",
807
+ "terminate_process(server_process)"
808
+ ]
809
+ },
810
+ {
811
+ "cell_type": "markdown",
812
+ "metadata": {},
813
+ "source": [
814
+ "> **Note:** \n",
815
+ "> The model may still default to JSON if it was heavily finetuned on that format. Prompt engineering (including examples) is the only way to increase the chance of pythonic output if you are not using a chat template."
816
+ ]
817
+ },
818
+ {
819
+ "cell_type": "markdown",
820
+ "metadata": {},
821
+ "source": [
822
+ "## How to support a new model?\n",
823
+ "1. Update the TOOLS_TAG_LIST in sglang/srt/function_call_parser.py with the model’s tool tags. Currently supported tags include:\n",
824
+ "```\n",
825
+ "\tTOOLS_TAG_LIST = [\n",
826
+ "\t “<|plugin|>“,\n",
827
+ "\t “<function=“,\n",
828
+ "\t “<tool_call>“,\n",
829
+ "\t “<|python_tag|>“,\n",
830
+ "\t “[TOOL_CALLS]”\n",
831
+ "\t]\n",
832
+ "```\n",
833
+ "2. Create a new detector class in sglang/srt/function_call_parser.py that inherits from BaseFormatDetector. The detector should handle the model’s specific function call format. For example:\n",
834
+ "```\n",
835
+ " class NewModelDetector(BaseFormatDetector):\n",
836
+ "```\n",
837
+ "3. Add the new detector to the MultiFormatParser class that manages all the format detectors."
838
+ ]
839
+ }
840
+ ],
841
+ "metadata": {
842
+ "language_info": {
843
+ "codemirror_mode": {
844
+ "name": "ipython",
845
+ "version": 3
846
+ },
847
+ "file_extension": ".py",
848
+ "mimetype": "text/x-python",
849
+ "name": "python",
850
+ "nbconvert_exporter": "python",
851
+ "pygments_lexer": "ipython3"
852
+ }
853
+ },
854
+ "nbformat": 4,
855
+ "nbformat_minor": 4
856
+ }
sglang/docs/advanced_features/vlm_query.ipynb ADDED
@@ -0,0 +1,388 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "id": "0",
6
+ "metadata": {},
7
+ "source": [
8
+ "# Query VLM with Offline Engine\n",
9
+ "\n",
10
+ "This tutorial demonstrates how to use SGLang's **offline Engine API** to query VLMs. We will demonstrate usage with Qwen2.5-VL and Llama 4. This section demonstrates three different calling approaches:\n",
11
+ "\n",
12
+ "1. **Basic Call**: Directly pass images and text.\n",
13
+ "2. **Processor Output**: Use HuggingFace processor for data preprocessing.\n",
14
+ "3. **Precomputed Embeddings**: Pre-calculate image features to improve inference efficiency."
15
+ ]
16
+ },
17
+ {
18
+ "cell_type": "markdown",
19
+ "id": "1",
20
+ "metadata": {},
21
+ "source": [
22
+ "## Understanding the Three Input Formats\n",
23
+ "\n",
24
+ "SGLang supports three ways to pass visual data, each optimized for different scenarios:\n",
25
+ "\n",
26
+ "### 1. **Raw Images** - Simplest approach\n",
27
+ "- Pass PIL Images, file paths, URLs, or base64 strings directly\n",
28
+ "- SGLang handles all preprocessing automatically\n",
29
+ "- Best for: Quick prototyping, simple applications\n",
30
+ "\n",
31
+ "### 2. **Processor Output** - For custom preprocessing\n",
32
+ "- Pre-process images with HuggingFace processor\n",
33
+ "- Pass the complete processor output dict with `format: \"processor_output\"`\n",
34
+ "- Best for: Custom image transformations, integration with existing pipelines\n",
35
+ "- Requirement: Must use `input_ids` instead of text prompt\n",
36
+ "\n",
37
+ "### 3. **Precomputed Embeddings** - For maximum performance\n",
38
+ "- Pre-calculate visual embeddings using the vision encoder\n",
39
+ "- Pass embeddings with `format: \"precomputed_embedding\"`\n",
40
+ "- Best for: Repeated queries on same images, caching, high-throughput serving\n",
41
+ "- Performance gain: Avoids redundant vision encoder computation (30-50% speedup)\n",
42
+ "\n",
43
+ "**Key Rule**: Within a single request, use only one format for all images. Don't mix formats.\n",
44
+ "\n",
45
+ "The examples below demonstrate all three approaches with both Qwen2.5-VL and Llama 4 models."
46
+ ]
47
+ },
48
+ {
49
+ "cell_type": "markdown",
50
+ "id": "2",
51
+ "metadata": {},
52
+ "source": [
53
+ "## Querying Qwen2.5-VL Model"
54
+ ]
55
+ },
56
+ {
57
+ "cell_type": "code",
58
+ "execution_count": null,
59
+ "id": "3",
60
+ "metadata": {},
61
+ "outputs": [],
62
+ "source": [
63
+ "import nest_asyncio\n",
64
+ "\n",
65
+ "nest_asyncio.apply()\n",
66
+ "\n",
67
+ "model_path = \"Qwen/Qwen2.5-VL-3B-Instruct\"\n",
68
+ "chat_template = \"qwen2-vl\""
69
+ ]
70
+ },
71
+ {
72
+ "cell_type": "code",
73
+ "execution_count": null,
74
+ "id": "4",
75
+ "metadata": {},
76
+ "outputs": [],
77
+ "source": [
78
+ "from io import BytesIO\n",
79
+ "import requests\n",
80
+ "from PIL import Image\n",
81
+ "\n",
82
+ "from sglang.srt.parser.conversation import chat_templates\n",
83
+ "\n",
84
+ "image = Image.open(\n",
85
+ " BytesIO(\n",
86
+ " requests.get(\n",
87
+ " \"https://github.com/sgl-project/sglang/blob/main/examples/assets/example_image.png?raw=true\"\n",
88
+ " ).content\n",
89
+ " )\n",
90
+ ")\n",
91
+ "\n",
92
+ "conv = chat_templates[chat_template].copy()\n",
93
+ "conv.append_message(conv.roles[0], f\"What's shown here: {conv.image_token}?\")\n",
94
+ "conv.append_message(conv.roles[1], \"\")\n",
95
+ "conv.image_data = [image]\n",
96
+ "\n",
97
+ "print(\"Generated prompt text:\")\n",
98
+ "print(conv.get_prompt())\n",
99
+ "print(f\"\\nImage size: {image.size}\")\n",
100
+ "image"
101
+ ]
102
+ },
103
+ {
104
+ "cell_type": "markdown",
105
+ "id": "5",
106
+ "metadata": {},
107
+ "source": [
108
+ "### Basic Offline Engine API Call"
109
+ ]
110
+ },
111
+ {
112
+ "cell_type": "code",
113
+ "execution_count": null,
114
+ "id": "6",
115
+ "metadata": {},
116
+ "outputs": [],
117
+ "source": [
118
+ "from sglang import Engine\n",
119
+ "\n",
120
+ "llm = Engine(model_path=model_path, chat_template=chat_template, log_level=\"warning\")"
121
+ ]
122
+ },
123
+ {
124
+ "cell_type": "code",
125
+ "execution_count": null,
126
+ "id": "7",
127
+ "metadata": {},
128
+ "outputs": [],
129
+ "source": [
130
+ "out = llm.generate(prompt=conv.get_prompt(), image_data=[image])\n",
131
+ "print(\"Model response:\")\n",
132
+ "print(out[\"text\"])"
133
+ ]
134
+ },
135
+ {
136
+ "cell_type": "markdown",
137
+ "id": "8",
138
+ "metadata": {},
139
+ "source": [
140
+ "### Call with Processor Output\n",
141
+ "\n",
142
+ "Using a HuggingFace processor to preprocess text and images, and passing the `processor_output` directly into `Engine.generate`."
143
+ ]
144
+ },
145
+ {
146
+ "cell_type": "code",
147
+ "execution_count": null,
148
+ "id": "9",
149
+ "metadata": {},
150
+ "outputs": [],
151
+ "source": [
152
+ "from transformers import AutoProcessor\n",
153
+ "\n",
154
+ "processor = AutoProcessor.from_pretrained(model_path, use_fast=True)\n",
155
+ "processor_output = processor(\n",
156
+ " images=[image], text=conv.get_prompt(), return_tensors=\"pt\"\n",
157
+ ")\n",
158
+ "\n",
159
+ "out = llm.generate(\n",
160
+ " input_ids=processor_output[\"input_ids\"][0].detach().cpu().tolist(),\n",
161
+ " image_data=[dict(processor_output, format=\"processor_output\")],\n",
162
+ ")\n",
163
+ "print(\"Response using processor output:\")\n",
164
+ "print(out[\"text\"])"
165
+ ]
166
+ },
167
+ {
168
+ "cell_type": "markdown",
169
+ "id": "10",
170
+ "metadata": {},
171
+ "source": [
172
+ "### Call with Precomputed Embeddings\n",
173
+ "\n",
174
+ "You can pre-calculate image features to avoid repeated visual encoding processes."
175
+ ]
176
+ },
177
+ {
178
+ "cell_type": "code",
179
+ "execution_count": null,
180
+ "id": "11",
181
+ "metadata": {},
182
+ "outputs": [],
183
+ "source": [
184
+ "from transformers import AutoProcessor\n",
185
+ "from transformers import Qwen2_5_VLForConditionalGeneration\n",
186
+ "\n",
187
+ "processor = AutoProcessor.from_pretrained(model_path, use_fast=True)\n",
188
+ "vision = (\n",
189
+ " Qwen2_5_VLForConditionalGeneration.from_pretrained(model_path).eval().visual.cuda()\n",
190
+ ")"
191
+ ]
192
+ },
193
+ {
194
+ "cell_type": "code",
195
+ "execution_count": null,
196
+ "id": "12",
197
+ "metadata": {},
198
+ "outputs": [],
199
+ "source": [
200
+ "processor_output = processor(\n",
201
+ " images=[image], text=conv.get_prompt(), return_tensors=\"pt\"\n",
202
+ ")\n",
203
+ "\n",
204
+ "input_ids = processor_output[\"input_ids\"][0].detach().cpu().tolist()\n",
205
+ "\n",
206
+ "precomputed_embeddings = vision(\n",
207
+ " processor_output[\"pixel_values\"].cuda(), processor_output[\"image_grid_thw\"].cuda()\n",
208
+ ")\n",
209
+ "\n",
210
+ "multi_modal_item = dict(\n",
211
+ " processor_output,\n",
212
+ " format=\"precomputed_embedding\",\n",
213
+ " feature=precomputed_embeddings,\n",
214
+ ")\n",
215
+ "\n",
216
+ "out = llm.generate(input_ids=input_ids, image_data=[multi_modal_item])\n",
217
+ "print(\"Response using precomputed embeddings:\")\n",
218
+ "print(out[\"text\"])\n",
219
+ "\n",
220
+ "llm.shutdown()"
221
+ ]
222
+ },
223
+ {
224
+ "cell_type": "markdown",
225
+ "id": "13",
226
+ "metadata": {},
227
+ "source": [
228
+ "## Querying Llama 4 Vision Model\n",
229
+ "\n",
230
+ "```python\n",
231
+ "model_path = \"meta-llama/Llama-4-Scout-17B-16E-Instruct\"\n",
232
+ "chat_template = \"llama-4\"\n",
233
+ "\n",
234
+ "from io import BytesIO\n",
235
+ "import requests\n",
236
+ "from PIL import Image\n",
237
+ "\n",
238
+ "from sglang.srt.parser.conversation import chat_templates\n",
239
+ "\n",
240
+ "# Download the same example image\n",
241
+ "image = Image.open(\n",
242
+ " BytesIO(\n",
243
+ " requests.get(\n",
244
+ " \"https://github.com/sgl-project/sglang/blob/main/examples/assets/example_image.png?raw=true\"\n",
245
+ " ).content\n",
246
+ " )\n",
247
+ ")\n",
248
+ "\n",
249
+ "conv = chat_templates[chat_template].copy()\n",
250
+ "conv.append_message(conv.roles[0], f\"What's shown here: {conv.image_token}?\")\n",
251
+ "conv.append_message(conv.roles[1], \"\")\n",
252
+ "conv.image_data = [image]\n",
253
+ "\n",
254
+ "print(\"Llama 4 generated prompt text:\")\n",
255
+ "print(conv.get_prompt())\n",
256
+ "print(f\"Image size: {image.size}\")\n",
257
+ "\n",
258
+ "image\n",
259
+ "```"
260
+ ]
261
+ },
262
+ {
263
+ "cell_type": "markdown",
264
+ "id": "14",
265
+ "metadata": {},
266
+ "source": [
267
+ "### Llama 4 Basic Call\n",
268
+ "\n",
269
+ "Llama 4 requires more computational resources, so it's configured with multi-GPU parallelism (tp_size=4) and larger context length.\n",
270
+ "\n",
271
+ "```python\n",
272
+ "llm = Engine(\n",
273
+ " model_path=model_path,\n",
274
+ " enable_multimodal=True,\n",
275
+ " attention_backend=\"fa3\",\n",
276
+ " tp_size=4,\n",
277
+ " context_length=65536,\n",
278
+ ")\n",
279
+ "\n",
280
+ "out = llm.generate(prompt=conv.get_prompt(), image_data=[image])\n",
281
+ "print(\"Llama 4 response:\")\n",
282
+ "print(out[\"text\"])\n",
283
+ "```"
284
+ ]
285
+ },
286
+ {
287
+ "cell_type": "markdown",
288
+ "id": "15",
289
+ "metadata": {},
290
+ "source": [
291
+ "### Call with Processor Output\n",
292
+ "\n",
293
+ "Using HuggingFace processor to preprocess data can reduce computational overhead during inference.\n",
294
+ "\n",
295
+ "```python\n",
296
+ "from transformers import AutoProcessor\n",
297
+ "\n",
298
+ "processor = AutoProcessor.from_pretrained(model_path, use_fast=True)\n",
299
+ "processor_output = processor(\n",
300
+ " images=[image], text=conv.get_prompt(), return_tensors=\"pt\"\n",
301
+ ")\n",
302
+ "\n",
303
+ "out = llm.generate(\n",
304
+ " input_ids=processor_output[\"input_ids\"][0].detach().cpu().tolist(),\n",
305
+ " image_data=[dict(processor_output, format=\"processor_output\")],\n",
306
+ ")\n",
307
+ "print(\"Response using processor output:\")\n",
308
+ "print(out)\n",
309
+ "```"
310
+ ]
311
+ },
312
+ {
313
+ "cell_type": "markdown",
314
+ "id": "16",
315
+ "metadata": {},
316
+ "source": [
317
+ "### Call with Precomputed Embeddings\n",
318
+ "\n",
319
+ "```python\n",
320
+ "from transformers import AutoProcessor\n",
321
+ "from transformers import Llama4ForConditionalGeneration\n",
322
+ "\n",
323
+ "processor = AutoProcessor.from_pretrained(model_path, use_fast=True)\n",
324
+ "model = Llama4ForConditionalGeneration.from_pretrained(\n",
325
+ " model_path, torch_dtype=\"auto\"\n",
326
+ ").eval()\n",
327
+ "\n",
328
+ "vision = model.vision_model.cuda()\n",
329
+ "multi_modal_projector = model.multi_modal_projector.cuda()\n",
330
+ "\n",
331
+ "print(f'Image pixel values shape: {processor_output[\"pixel_values\"].shape}')\n",
332
+ "input_ids = processor_output[\"input_ids\"][0].detach().cpu().tolist()\n",
333
+ "\n",
334
+ "# Process image through vision encoder\n",
335
+ "image_outputs = vision(\n",
336
+ " processor_output[\"pixel_values\"].to(\"cuda\"), \n",
337
+ " aspect_ratio_ids=processor_output[\"aspect_ratio_ids\"].to(\"cuda\"),\n",
338
+ " aspect_ratio_mask=processor_output[\"aspect_ratio_mask\"].to(\"cuda\"),\n",
339
+ " output_hidden_states=False\n",
340
+ ")\n",
341
+ "image_features = image_outputs.last_hidden_state\n",
342
+ "\n",
343
+ "# Flatten image features and pass through multimodal projector\n",
344
+ "vision_flat = image_features.view(-1, image_features.size(-1))\n",
345
+ "precomputed_embeddings = multi_modal_projector(vision_flat)\n",
346
+ "\n",
347
+ "# Build precomputed embedding data item\n",
348
+ "mm_item = dict(\n",
349
+ " processor_output, \n",
350
+ " format=\"precomputed_embedding\", \n",
351
+ " feature=precomputed_embeddings\n",
352
+ ")\n",
353
+ "\n",
354
+ "# Use precomputed embeddings for efficient inference\n",
355
+ "out = llm.generate(input_ids=input_ids, image_data=[mm_item])\n",
356
+ "print(\"Llama 4 precomputed embedding response:\")\n",
357
+ "print(out[\"text\"])\n",
358
+ "```"
359
+ ]
360
+ }
361
+ ],
362
+ "metadata": {
363
+ "jupytext": {
364
+ "cell_metadata_filter": "-all",
365
+ "custom_cell_magics": "kql",
366
+ "encoding": "# -*- coding: utf-8 -*-",
367
+ "text_representation": {
368
+ "extension": ".py",
369
+ "format_name": "light",
370
+ "format_version": "1.5",
371
+ "jupytext_version": "1.16.1"
372
+ }
373
+ },
374
+ "language_info": {
375
+ "codemirror_mode": {
376
+ "name": "ipython",
377
+ "version": 3
378
+ },
379
+ "file_extension": ".py",
380
+ "mimetype": "text/x-python",
381
+ "name": "python",
382
+ "nbconvert_exporter": "python",
383
+ "pygments_lexer": "ipython3"
384
+ }
385
+ },
386
+ "nbformat": 4,
387
+ "nbformat_minor": 5
388
+ }
sglang/docs/basic_usage/deepseek_ocr.md ADDED
@@ -0,0 +1,54 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # DeepSeek OCR (OCR-1 / OCR-2)
2
+
3
+ DeepSeek OCR models are multimodal (image + text) models for OCR and document understanding.
4
+
5
+ ## Launch server
6
+
7
+ ```shell
8
+ python -m sglang.launch_server \
9
+ --model-path deepseek-ai/DeepSeek-OCR-2 \
10
+ --trust-remote-code \
11
+ --host 0.0.0.0 \
12
+ --port 30000
13
+ ```
14
+
15
+ > You can replace `deepseek-ai/DeepSeek-OCR-2` with `deepseek-ai/DeepSeek-OCR`.
16
+
17
+ ## Prompt examples
18
+
19
+ Recommended prompts from the model card:
20
+
21
+ ```
22
+ <image>
23
+ <|grounding|>Convert the document to markdown.
24
+ ```
25
+
26
+ ```
27
+ <image>
28
+ Free OCR.
29
+ ```
30
+
31
+ ## OpenAI-compatible request example
32
+
33
+ ```python
34
+ import requests
35
+
36
+ url = "http://localhost:30000/v1/chat/completions"
37
+
38
+ data = {
39
+ "model": "deepseek-ai/DeepSeek-OCR-2",
40
+ "messages": [
41
+ {
42
+ "role": "user",
43
+ "content": [
44
+ {"type": "text", "text": "<image>\n<|grounding|>Convert the document to markdown."},
45
+ {"type": "image_url", "image_url": {"url": "https://example.com/your_image.jpg"}},
46
+ ],
47
+ }
48
+ ],
49
+ "max_tokens": 512,
50
+ }
51
+
52
+ response = requests.post(url, json=data)
53
+ print(response.text)
54
+ ```
sglang/docs/basic_usage/deepseek_v32.md ADDED
@@ -0,0 +1,459 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # DeepSeek V3.2 Usage
2
+
3
+ DeepSeek-V3.2 model family equips DeepSeek-V3.1-Terminus with DeepSeek Sparse Attention (DSA) through continued training. With DSA, a fine-grained sparse attention mechanism powered by a lightning indexer, DeepSeek-V3.2 achieves efficiency improvements in long-context scenarios.
4
+
5
+ For reporting issues or tracking upcoming features, please refer to this [Roadmap](https://github.com/sgl-project/sglang/issues/11060).
6
+
7
+ Note: This document is originally written for the usage of [DeepSeek-V3.2-Exp](https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp) model. The usage of [DeepSeek-V3.2](https://huggingface.co/deepseek-ai/DeepSeek-V3.2) or [DeepSeek-V3.2-Speciale](https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Speciale) is the same as DeepSeek-V3.2-Exp except for the tool call parser.
8
+
9
+
10
+ ## Installation
11
+
12
+ ### Docker
13
+
14
+ ```bash
15
+ # H200/B200
16
+ docker pull lmsysorg/sglang:latest
17
+
18
+ # MI350/MI355
19
+ docker pull lmsysorg/sglang:v0.5.8-rocm700-mi35x
20
+
21
+ # MI300
22
+ # v0.5.8-rocm700-mi30x does not include PR #17504. Prefer the newest MI30x ROCm
23
+ # image tag from Docker Hub when available, or build from source (below).
24
+ docker pull lmsysorg/sglang:v0.5.8-rocm700-mi30x
25
+
26
+
27
+ # NPUs
28
+ docker pull lmsysorg/sglang:dsv32-a2
29
+ docker pull lmsysorg/sglang:dsv32-a3
30
+ ```
31
+
32
+ ### Build From Source
33
+
34
+ ```bash
35
+ # Install SGLang
36
+ git clone https://github.com/sgl-project/sglang
37
+ cd sglang
38
+ pip3 install pip --upgrade
39
+ pip3 install -e "python"
40
+ ```
41
+ ## Launch DeepSeek V3.2 with SGLang
42
+
43
+ To serve [DeepSeek-V3.2-Exp](https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp) on 8xH200/B200 GPUs:
44
+
45
+ ```bash
46
+ # Launch with TP + DP (Recommended)
47
+ python -m sglang.launch_server --model deepseek-ai/DeepSeek-V3.2-Exp --tp 8 --dp 8 --enable-dp-attention
48
+
49
+ # Launch with EP + DP
50
+ python -m sglang.launch_server --model deepseek-ai/DeepSeek-V3.2-Exp --tp 8 --ep 8 --dp 8 --enable-dp-attention
51
+
52
+ # Launch with Pure TP
53
+ python -m sglang.launch_server --model deepseek-ai/DeepSeek-V3.2-Exp --tp 8
54
+
55
+ # Launch with TP on MI30x/MI35x
56
+ python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3.2-Exp --tp 8 --nsa-prefill-backend tilelang --nsa-decode-backend tilelang
57
+ ```
58
+
59
+ ### Configuration Tips
60
+ - **DP Attention (Recommended)**: For DeepSeek V3.2 model, the kernels are customized for the use case of `dp_size=8`, so DP attention (`--dp 8 --enable-dp-attention`) is the recommended configuration for better stability and performance. All test cases use this configuration by default.
61
+ - **Pure TP Mode**: Launching with pure TP (without `--dp` and `--enable-dp-attention`) is also supported. Note that this mode has not been fully validated in PD disaggregation scenarios.
62
+ - **Short-sequence MHA prefill (adaptive)**: For short prefill sequences (default threshold: **2048 tokens**), the NSA backend uses standard MHA automatically (no extra flags). On H200 (SM90) this path uses the FlashAttention variable-length kernel; on B200 (SM100) it uses TRT-LLM ragged MHA. MHA uses `MHA_ONE_SHOT` for best performance. `MHA_ONE_SHOT` computes multi-head attention over all tokens (both cached prefix and newly extended tokens) in a single kernel invocation, avoiding the overhead of chunked KV cache processing. This achieves optimal throughput for short sequences where total sequence length fits within the chunk capacity limit.
63
+ - **Choices of Attention Kernels**: The attention backend is automatically set to `nsa` attention backend for DeepSeek V3.2 model. In this backend, different kernels for sparse prefilling/decoding are implemented, which can be specified by `--nsa-prefill-backend` and `--nsa-decode-backend` server arguments. The choices of nsa prefill/decode attention kernels include:
64
+ - `flashmla_sparse`: `flash_mla_sparse_fwd` kernel from `flash_mla` library. Can run on both Hopper and Blackwell GPUs. It requires bf16 q, kv inputs.
65
+ - `flashmla_kv`: `flash_mla_with_kvcache` kernel from `flash_mla` library. Can run on both Hopper and Blackwell GPUs. It requires bf16 q, fp8 k_cache inputs.
66
+ - `fa3`: `flash_attn_with_kvcache` kernel from `flash_attn` library. Can only run on Hopper GPUs. It requires bf16 q, kv inputs.
67
+ - `tilelang`: `tilelang` implementation that can run on GPU, HPU and NPU.
68
+ - `aiter`: Aiter kernel on AMD HPUs. Can only be used as decode kernel.
69
+ - `trtllm`: `trtllm-mla` sparse kernel from flashinfer library. Only run on blackwell GPUs. It requires QKV bf16 or QKV fp8.
70
+ - On the basis of performance benchmarks, the default configuration on H200 and B200 are set as follows :
71
+ - H200: `flashmla_sparse` prefill attention (short-seq prefill uses MHA via FlashAttention varlen), `fa3` decode attention, `bf16` kv cache dtype.
72
+ - B200: `flashmla_auto` prefill attention (short-seq prefill uses MHA via TRT-LLM ragged), `flashmla_kv` decode attention, `fp8_e4m3` kv cache dtype. `flashmla_auto` enables automatic selection of either `flashmla_sparse` or `flashmla_kv` kernel for prefill based on KV cache dtype, hardware, and heuristics. When FP8 KV cache is enabled and `total_kv_tokens < total_q_tokens * 512`, it uses the `flashmla_sparse` kernel; otherwise, it falls back to the `flashmla_kv` kernel. The heuristics may need to be tuned if the performance of either the `flashmla_sparse` or `flashmla_kv` kernel changes significantly.
73
+ - On Blackwell platform, with slightly accuracy drop, the performance can boost up to 3x-5x
74
+ - B200: by choosing `trtllm` for both `--nsa-prefill-backend` and `--nsa-decode-backend`, the prefill attention use MHA via TRT-LLM ragged for both short and long sequence (**accuracy impact**). Combine the `trtllm` with `fp8_e4m3` kv cache, the kv cache dim is `576` (kv_lora_rank + qk_rope_head_dim) (**accuracy impact**), compare to the combination of `flashmla_auto` and `fp8_e4m` kv cache dim is `656` (kv_lora_rank + scale storage (kv_lora_rank // quant_block_size * 4 bytes) + rope dimension storage).
75
+
76
+
77
+ ## Multi-token Prediction
78
+ SGLang implements Multi-Token Prediction (MTP) for DeepSeek V3.2 based on [EAGLE speculative decoding](https://docs.sglang.io/advanced_features/speculative_decoding.html#EAGLE-Decoding). With this optimization, the decoding speed can be improved significantly on small batch sizes. Please look at [this PR](https://github.com/sgl-project/sglang/pull/11652) for more information.
79
+
80
+ Example usage with DP Attention:
81
+ ```bash
82
+ python -m sglang.launch_server --model deepseek-ai/DeepSeek-V3.2-Exp --tp 8 --dp 8 --enable-dp-attention --speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4
83
+ ```
84
+
85
+ Example usage with Pure TP:
86
+ ```bash
87
+ python -m sglang.launch_server --model deepseek-ai/DeepSeek-V3.2-Exp --tp 8 --speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4
88
+ ```
89
+
90
+ - The best configuration for `--speculative-num-steps`, `--speculative-eagle-topk` and `--speculative-num-draft-tokens` can be searched with [bench_speculative.py](https://github.com/sgl-project/sglang/blob/main/scripts/playground/bench_speculative.py) script for given batch size. The minimum configuration is `--speculative-num-steps 1 --speculative-eagle-topk 1 --speculative-num-draft-tokens 2`, which can achieve speedup for larger batch sizes.
91
+ - The default value of `--max-running-requests` is set to `48` for MTP. For larger batch sizes, this value should be increased beyond the default value.
92
+
93
+ ```{tip}
94
+ To enable the experimental overlap scheduler for EAGLE speculative decoding, set the environment variable `SGLANG_ENABLE_SPEC_V2=1`. This can improve performance by enabling overlap scheduling between draft and verification stages.
95
+ ```
96
+
97
+
98
+ ## Function Calling and Reasoning Parser
99
+ The usage of function calling and reasoning parser is the same as DeepSeek V3.1. Please refer to [Reasoning Parser](https://docs.sglang.io/advanced_features/separate_reasoning.html) and [Tool Parser](https://docs.sglang.io/advanced_features/tool_parser.html) documents.
100
+
101
+ To launch `DeepSeek-V3.2-Exp` with function calling and reasoning parser:
102
+ > Note: It is recommended to specify the chat-template, ensuring that you are within the sglang's root directory.
103
+ ```bash
104
+ python3 -m sglang.launch_server \
105
+ --model-path deepseek-ai/DeepSeek-V3.2-Exp \
106
+ --trust-remote-code \
107
+ --tp-size 8 --dp-size 8 --enable-dp-attention \
108
+ --tool-call-parser deepseekv31 \
109
+ --reasoning-parser deepseek-v3 \
110
+ --chat-template ./examples/chat_template/tool_chat_template_deepseekv32.jinja
111
+ ```
112
+
113
+ To launch `DeepSeek-V3.2` with function calling and reasoning parser:
114
+ ```bash
115
+ python3 -m sglang.launch_server \
116
+ --model-path deepseek-ai/DeepSeek-V3.2 \
117
+ --trust-remote-code \
118
+ --tp-size 8 --dp-size 8 --enable-dp-attention \
119
+ --tool-call-parser deepseekv32 \
120
+ --reasoning-parser deepseek-v3
121
+ ```
122
+
123
+ `DeepSeek-V3.2-Speciale` doesn't support tool calling, so can only be launched with reasoning parser:
124
+ ```bash
125
+ python3 -m sglang.launch_server \
126
+ --model-path deepseek-ai/DeepSeek-V3.2-Speciale \
127
+ --trust-remote-code \
128
+ --tp-size 8 --dp-size 8 --enable-dp-attention \
129
+ --reasoning-parser deepseek-v3
130
+ ```
131
+
132
+ ## NVFP4 Checkpoint
133
+
134
+ To launch deepseek v3.2 [NVFP4 checkpoint](https://huggingface.co/nvidia/DeepSeek-V3.2-NVFP4) on Blackwell devices, the user needs to specify the quantization method as `modelopt_fp4`, and moe runner backend as one of `flashinfer_trtllm`(recommended), `flashinfer_cutlass` and `flashinfer_cutedsl`. Any other usage (parallelism, reasoning parser, ...) is the same as FP8 checkpoint.
135
+
136
+ An example launching command can be:
137
+ ```bash
138
+ python -m sglang.launch_server --model nvidia/DeepSeek-V3.2-NVFP4 --tp 4 --quantization modelopt_fp4 --moe-runner-backend flashinfer_trtllm --tool-call-parser deepseekv32 --reasoning-parser deepseek-v3
139
+ ```
140
+
141
+ ## PD Disaggregation
142
+
143
+ Prefill Command:
144
+ ```bash
145
+ python -m sglang.launch_server \
146
+ --model-path deepseek-ai/DeepSeek-V3.2-Exp \
147
+ --disaggregation-mode prefill \
148
+ --host $LOCAL_IP \
149
+ --port $PORT \
150
+ --tp 8 \
151
+ --dp 8 \
152
+ --enable-dp-attention \
153
+ --dist-init-addr ${HOST}:${DIST_PORT} \
154
+ --trust-remote-code \
155
+ --disaggregation-bootstrap-port 8998 \
156
+ --mem-fraction-static 0.9 \
157
+ ```
158
+
159
+ Decode command:
160
+ ```bash
161
+ python -m sglang.launch_server \
162
+ --model-path deepseek-ai/DeepSeek-V3.2-Exp \
163
+ --disaggregation-mode decode \
164
+ --host $LOCAL_IP \
165
+ --port $PORT \
166
+ --tp 8 \
167
+ --dp 8 \
168
+ --enable-dp-attention \
169
+ --dist-init-addr ${HOST}:${DIST_PORT} \
170
+ --trust-remote-code \
171
+ --mem-fraction-static 0.9 \
172
+ ```
173
+
174
+ Router command:
175
+ ```bash
176
+ python -m sglang_router.launch_router --pd-disaggregation \
177
+ --prefill $PREFILL_ADDR 8998 \
178
+ --decode $DECODE_ADDR \
179
+ --host 127.0.0.1 \
180
+ --port 8000 \
181
+ ```
182
+
183
+ If you need more advanced deployment methods or production-ready deployment methods, such as RBG or LWS-based deployment, please refer to [references/multi_node_deployment/rbg_pd/deepseekv32_pd.md](../references/multi_node_deployment/rbg_pd/deepseekv32_pd.md). Additionally, you can also find startup commands for DeepEP-based EP parallelism in the aforementioned documentation.
184
+
185
+
186
+ ## Benchmarking Results
187
+
188
+ ### Accuracy Test with `gsm8k`
189
+ A simple accuracy benchmark can be tested with `gsm8k` dataset:
190
+ ```bash
191
+ python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1319 --parallel 1319
192
+ ```
193
+
194
+ The result is 0.956, which matches our expectation:
195
+ ```bash
196
+ Accuracy: 0.956
197
+ Invalid: 0.000
198
+ Latency: 25.109 s
199
+ Output throughput: 5226.235 token/s
200
+ ```
201
+
202
+ To test long-context accuracy, run gsm8k with `--num-shots 20`. The results are very close to the 8 shots results:
203
+ ```
204
+ Accuracy: 0.956
205
+ Invalid: 0.000
206
+ Latency: 29.545 s
207
+ Output throughput: 4418.617 token/s
208
+ ```
209
+
210
+
211
+ ### Accuracy Test with `gpqa-diamond`
212
+
213
+ Accuracy benchmark on long context can be tested on GPQA-diamond dataset with long output tokens and thinking enabled:
214
+ ```bash
215
+ python3 -m sglang.test.run_eval --port 30000 --eval-name gpqa --num-examples 198 --max-tokens 128000 --repeat 8 --thinking-mode deepseek-v3
216
+ ```
217
+
218
+ The mean accuracy over 8 runs shows 0.797, which matches the number 0.799 in official tech report.
219
+ ```bash
220
+ Repeat: 8, mean: 0.797
221
+ Scores: ['0.808', '0.798', '0.808', '0.798', '0.783', '0.788', '0.803', '0.793']
222
+ ```
223
+
224
+ For Deepseek V3.2, Deepseek recommends setting the sampling parameters to temperature = 1.0, top_p = 0.95:
225
+
226
+ ```bash
227
+ python3 -m sglang.test.run_eval --port 30000 --eval-name gpqa --num-examples 198 --max-tokens 128000 --repeat 8 --top-p 0.95 --temperature 1.0 --thinking-mode deepseek-v3
228
+
229
+ Repeat: 8, mean: 0.840
230
+ Scores: ['0.848', '0.808', '0.848', '0.838', '0.879', '0.813', '0.838', '0.848']
231
+ ```
232
+ which matches the official score, 0.824, as reported in the [Deepseek-V3.2 technical report](https://huggingface.co/deepseek-ai/DeepSeek-V3.2/blob/main/assets/paper.pdf).
233
+
234
+ ### Accuracy Test with `aime 2025`
235
+
236
+ Prepare the environment by installing NeMo-Skills in the docker or your own virtual environment:
237
+
238
+ ```
239
+ pip install git+https://github.com/NVIDIA/NeMo-Skills.git --ignore-installed blinker
240
+ ```
241
+
242
+ Then launch the SGLang server:
243
+ ```
244
+ python -m sglang.launch_server --model deepseek-ai/DeepSeek-V3.2-Exp --tp 8 --dp 8 --enable-dp-attention
245
+ ```
246
+
247
+ **For `DeepSeek-V3.2` and `DeepSeek-V3.2-Speciale`**:
248
+
249
+ ```
250
+ python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-V3.2 --trust-remote-code --tp-size 8 --dp-size 8 --enable-dp-attention --tool-call-parser deepseekv32 --reasoning-parser deepseek-v3
251
+ ```
252
+
253
+ Run the following script to evaluate AIME 2025:
254
+ ```
255
+ #! /bin/bash
256
+ export NEMO_SKILLS_DISABLE_UNCOMMITTED_CHANGES_CHECK=1
257
+
258
+ ns prepare_data aime25
259
+
260
+ PORT=30000
261
+ BACKEND=sglang
262
+ MODEL="deepseek-ai/DeepSeek-V3.2-Exp" # Should be changed to the model name
263
+ MODEL_NAME="dsv32-fp8"
264
+
265
+ echo "Starting AIME25 evaluation with model $MODEL on port $PORT using backend $BACKEND..."
266
+ ns eval \
267
+ --benchmarks=aime25:4 \
268
+ --server_type=$BACKEND \
269
+ --model=$MODEL \
270
+ --server_address=http://localhost:${PORT}/v1 \
271
+ --output_dir=nemo_skills_aime25_${MODEL_NAME}_output_${BACKEND}_$(date +%Y%m%d_%H%M%S) \
272
+ ++chat_template_kwargs.thinking=true \
273
+ ++inference.temperature=1.0 \
274
+ ++inference.top_p=0.95 \
275
+ ++inference.tokens_to_generate=64000
276
+ # ++inference.tokens_to_generate=120000 for Speciale model
277
+ ```
278
+
279
+ Test results (8*B200):
280
+
281
+ DeepSeek-V3.2-Exp:
282
+
283
+ | evaluation_mode | num_entries | avg_tokens | gen_seconds | symbolic_correct | no_answer |
284
+ |--------------------|-------------|------------|-------------|-----------------------|-----------|
285
+ | pass@1[avg-of-4] | 30 | 15040 | 1673 | 87.50% ± 1.67% | 0.00% |
286
+ | majority@4 | 30 | 15040 | 1673 | 90.00% | 0.00% |
287
+ | pass@4 | 30 | 15040 | 1673 | 90.00% | 0.00% |
288
+
289
+
290
+ DeepSeek-V3.2:
291
+ | evaluation_mode | num_entries | avg_tokens | gen_seconds | symbolic_correct | no_answer |
292
+ |--------------------|-------------|------------|-------------|-----------------------|-----------|
293
+ | pass@1[avg-of-4] | 30 | 13550 | 1632 | 92.50% ± 1.67% | 0.00% |
294
+ | majority@4 | 30 | 13550 | 1632 | 94.71% | 0.00% |
295
+ | pass@4 | 30 | 13550 | 1632 | 96.67% | 0.00% |
296
+
297
+
298
+ DeepSeek-V3.2-Speciale:
299
+ | evaluation_mode | num_entries | avg_tokens | gen_seconds | symbolic_correct | no_answer |
300
+ |--------------------|-------------|------------|-------------|-----------------------|-----------|
301
+ | pass@1[avg-of-4] | 30 | 24155 | 3583 | 95.00% ± 1.92% | 0.00% |
302
+ | majority@4 | 30 | 24155 | 3583 | 95.83% | 0.00% |
303
+ | pass@4 | 30 | 24155 | 3583 | 100.00% | 0.00% |
304
+
305
+
306
+
307
+ ## DSA long sequence context parallel optimization(experimental)
308
+
309
+ **Note: This feature is only verified on Hopper machines**
310
+
311
+ For context parallel in DeepSeek V3.2 model, we provide two different modes of splitting tokens, which can be controlled with argument `--nsa-prefill-cp-mode`.
312
+
313
+ ### In sequence splitting
314
+
315
+ The first mode can be enabled by `--nsa-prefill-cp-mode in-seq-split`. This mode implements context parallel for DSA by splitting the sequence uniformly between context parallel ranks. At attention stage, each cp rank computes the indexer results of sharded sequence, and collects the whole kv cache through all gather operator. Add `attn_cp_size` for communication group for context parallel.
316
+
317
+ Note that in sequence splitting mode has the following restrictions:
318
+ - The batch size is restricted to 1 for prefill batches
319
+ - `moe_dense_tp_size=1`, `moe_a2a_backend = "deepep"`
320
+ - To ensure `cp_size > 1`, the passed in `tp_size` must be larger than `dp_size`
321
+
322
+ For more details, please refer to PR https://github.com/sgl-project/sglang/pull/12065.
323
+
324
+ Example:
325
+ ```bash
326
+ # In-seq splitting mode launched with EP + DP
327
+ python -m sglang.launch_server --model deepseek-ai/DeepSeek-V3.2-Exp --tp 8 --ep 8 --dp 2 --enable-dp-attention --enable-nsa-prefill-context-parallel --attn-cp-size 4 --nsa-prefill-cp-mode in-seq-split --max-running-requests 32
328
+ ```
329
+
330
+ ### Round robin splitting (default setting)
331
+
332
+ This mode can be enabled by specifying the parameter `--nsa-prefill-cp-mode round-robin-split`, which distributes tokens across ranks based on `token_idx % cp_size`.
333
+
334
+ In this scenario, compared with the aforementioned method, it additionally supports the fused MoE backend (the fused MoE backend may deliver better performance than DeepEP in single-machine scenarios), FP8 KV-cache, and multi-batch prefill inference. But it cannot be enabled with dp attention together.
335
+
336
+ For more details, please refer to PR https://github.com/sgl-project/sglang/pull/13959.
337
+
338
+ Example usage:
339
+ ```bash
340
+ # Launch with FusedMoe + CP8
341
+ python -m sglang.launch_server --model deepseek-ai/DeepSeek-V3.2-Exp --tp 8 --enable-nsa-prefill-context-parallel --attn-cp-size 8 --nsa-prefill-cp-mode round-robin-split --max-running-requests 32
342
+ ```
343
+ ### Pipeline Parallel + Context Parallel (PP + CP)
344
+
345
+ This mode combines Pipeline Parallelism (PP) and Context Parallelism (CP) to scale across multiple nodes, which can achieve better throughput and Time To First Token (TTFT). Note that this method has only been tested on H20 96G.
346
+
347
+ #### Standard Usage
348
+
349
+ To launch with PP=2 and CP (via `round-robin-split` mode) on 2 nodes. This configuration uses the fused MoE kernel by default, which generally provides better performance.
350
+
351
+ For related development details, please refer to:
352
+ - Fused MoE + CP support: [PR #13959](https://github.com/sgl-project/sglang/pull/13959)
353
+ - PP + CP support: [Issue #15358](https://github.com/sgl-project/sglang/issues/15358) and [PR #16380](https://github.com/sgl-project/sglang/pull/16380)
354
+
355
+ Node 0:
356
+ ```bash
357
+ export SGLANG_PP_LAYER_PARTITION=30,31
358
+ python3 -m sglang.launch_server \
359
+ --model-path deepseek-ai/DeepSeek-V3.2-Exp \
360
+ --nnodes 2 --node-rank 0 \
361
+ --dist-init-addr <HEAD_NODE_IP>:62001 \
362
+ --tp 8 --pp-size 2 \
363
+ --dp-size 1 --moe-dense-tp-size 1 \
364
+ --enable-nsa-prefill-context-parallel \
365
+ --attn-cp-size 8 \
366
+ --nsa-prefill-cp-mode round-robin-split \
367
+ --trust-remote-code \
368
+ --disable-radix-cache \
369
+ --mem-fraction-static 0.8 \
370
+ --max-running-requests 128 \
371
+ --chunked-prefill-size 16384 \
372
+ --cuda-graph-max-bs 8 \
373
+ --page-size 64 \
374
+ --watchdog-timeout 3600 \
375
+ --host 0.0.0.0 --port 8000 \
376
+ --tool-call-parser deepseekv32
377
+ ```
378
+
379
+ Node 1:
380
+ ```bash
381
+ export SGLANG_PP_LAYER_PARTITION=30,31
382
+ python3 -m sglang.launch_server \
383
+ --model-path deepseek-ai/DeepSeek-V3.2-Exp \
384
+ --nnodes 2 --node-rank 1 \
385
+ --dist-init-addr <HEAD_NODE_IP>:62001 \
386
+ --tp 8 --pp-size 2 \
387
+ --dp-size 1 --moe-dense-tp-size 1 \
388
+ --enable-nsa-prefill-context-parallel \
389
+ --attn-cp-size 8 \
390
+ --nsa-prefill-cp-mode round-robin-split \
391
+ --trust-remote-code \
392
+ --disable-radix-cache \
393
+ --mem-fraction-static 0.8 \
394
+ --max-running-requests 128 \
395
+ --chunked-prefill-size 16384 \
396
+ --cuda-graph-max-bs 8 \
397
+ --page-size 64 \
398
+ --watchdog-timeout 3600 \
399
+ --host 0.0.0.0 --port 8000 \
400
+ --tool-call-parser deepseekv32
401
+ ```
402
+
403
+ #### PD Disaggregation with PP + CP
404
+
405
+ If using PD (Prefill-Decode) Disaggregation, the Prefill nodes can be configured with PP + CP as follows.
406
+
407
+ Prefill Node 0:
408
+ ```bash
409
+ python -m sglang.launch_server \
410
+ --model-path deepseek-ai/DeepSeek-V3.2-Exp \
411
+ --served-model-name deepseek-v32 \
412
+ --nnodes 2 --node-rank 0 \
413
+ --dist-init-addr <PREFILL_HEAD_IP>:20102 \
414
+ --tp 8 --pp-size 2 \
415
+ --dp-size 1 --moe-dense-tp-size 1 \
416
+ --enable-nsa-prefill-context-parallel \
417
+ --attn-cp-size 8 \
418
+ --nsa-prefill-cp-mode round-robin-split \
419
+ --disaggregation-ib-device mlx5_bond_0,mlx5_bond_1,mlx5_bond_2,mlx5_bond_3 \
420
+ --trust-remote-code \
421
+ --disable-radix-cache \
422
+ --max-running-requests 512 \
423
+ --chunked-prefill-size 4096 \
424
+ --context-length 131072 \
425
+ --mem-fraction-static 0.9 \
426
+ --page-size 64 \
427
+ --enable-metrics \
428
+ --collect-tokens-histogram \
429
+ --tokenizer-worker-num 8 \
430
+ --host 0.0.0.0 --port 30000
431
+ ```
432
+
433
+ Prefill Node 1:
434
+ ```bash
435
+ python -m sglang.launch_server \
436
+ --model-path deepseek-ai/DeepSeek-V3.2-Exp \
437
+ --served-model-name deepseek-v32-prefill \
438
+ --nnodes 2 --node-rank 1 \
439
+ --dist-init-addr <PREFILL_HEAD_IP>:20102 \
440
+ --tp 8 --pp-size 2 \
441
+ --dp-size 1 --moe-dense-tp-size 1 \
442
+ --enable-nsa-prefill-context-parallel \
443
+ --attn-cp-size 8 \
444
+ --nsa-prefill-cp-mode round-robin-split \
445
+ --disaggregation-ib-device mlx5_bond_0,mlx5_bond_1,mlx5_bond_2,mlx5_bond_3 \
446
+ --trust-remote-code \
447
+ --disable-radix-cache \
448
+ --max-running-requests 512 \
449
+ --chunked-prefill-size 4096 \
450
+ --context-length 131072 \
451
+ --mem-fraction-static 0.9 \
452
+ --page-size 64 \
453
+ --enable-metrics \
454
+ --collect-tokens-histogram \
455
+ --tokenizer-worker-num 8 \
456
+ --host 0.0.0.0 --port 30000
457
+ ```
458
+
459
+ For the Decode nodes, it is recommended to use the **EP mode**.
sglang/docs/basic_usage/glm45.md ADDED
@@ -0,0 +1,70 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## Launch GLM-4.5 / GLM-4.6 / GLM-4.7 with SGLang
2
+
3
+ To serve GLM-4.5 / GLM-4.6 FP8 models on 8xH100/H200 GPUs:
4
+
5
+ ```bash
6
+ python3 -m sglang.launch_server --model zai-org/GLM-4.6-FP8 --tp 8
7
+ ```
8
+
9
+ ### EAGLE Speculative Decoding
10
+
11
+ **Description**: SGLang has supported GLM-4.5 / GLM-4.6 models
12
+ with [EAGLE speculative decoding](https://docs.sglang.io/advanced_features/speculative_decoding.html#EAGLE-Decoding).
13
+
14
+ **Usage**:
15
+ Add arguments `--speculative-algorithm`, `--speculative-num-steps`, `--speculative-eagle-topk` and
16
+ `--speculative-num-draft-tokens` to enable this feature. For example:
17
+
18
+ ``` bash
19
+ python3 -m sglang.launch_server \
20
+ --model-path zai-org/GLM-4.6-FP8 \
21
+ --tp-size 8 \
22
+ --tool-call-parser glm45 \
23
+ --reasoning-parser glm45 \
24
+ --speculative-algorithm EAGLE \
25
+ --speculative-num-steps 3 \
26
+ --speculative-eagle-topk 1 \
27
+ --speculative-num-draft-tokens 4 \
28
+ --mem-fraction-static 0.9 \
29
+ --served-model-name glm-4.6-fp8 \
30
+ --enable-custom-logit-processor
31
+ ```
32
+
33
+ ```{tip}
34
+ To enable the experimental overlap scheduler for EAGLE speculative decoding, set the environment variable `SGLANG_ENABLE_SPEC_V2=1`. This can improve performance by enabling overlap scheduling between draft and verification stages.
35
+ ```
36
+
37
+ ### Thinking Budget for GLM-4.5 / GLM-4.6
38
+ **Note**: For GLM-4.7, `--tool-call-parser` should be set to `glm47`, for GLM-4.5 and GLM-4.6, it should be set to `glm45`.
39
+
40
+ In SGLang, we can implement thinking budget with `CustomLogitProcessor`.
41
+
42
+ Launch a server with `--enable-custom-logit-processor` flag on.
43
+
44
+ Sample Request:
45
+
46
+ ```python
47
+ import openai
48
+ from rich.pretty import pprint
49
+ from sglang.srt.sampling.custom_logit_processor import Glm4MoeThinkingBudgetLogitProcessor
50
+
51
+
52
+ client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="*")
53
+ response = client.chat.completions.create(
54
+ model="zai-org/GLM-4.6",
55
+ messages=[
56
+ {
57
+ "role": "user",
58
+ "content": "Question: Is Paris the Capital of France?",
59
+ }
60
+ ],
61
+ max_tokens=1024,
62
+ extra_body={
63
+ "custom_logit_processor": Glm4MoeThinkingBudgetLogitProcessor().to_str(),
64
+ "custom_params": {
65
+ "thinking_budget": 512,
66
+ },
67
+ },
68
+ )
69
+ pprint(response)
70
+ ```
sglang/docs/basic_usage/glmv.md ADDED
@@ -0,0 +1,136 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # GLM-4.6V / GLM-4.5V Usage
2
+
3
+ ## Launch commands for SGLang
4
+
5
+ Below are suggested launch commands tailored for different hardware / precision modes
6
+
7
+ ### FP8 (quantised) mode
8
+
9
+ For high memory-efficiency and latency optimized deployments (e.g., on H100, H200) where FP8 checkpoint is supported:
10
+
11
+ ```bash
12
+ python3 -m sglang.launch_server \
13
+ --model-path zai-org/GLM-4.6V-FP8 \
14
+ --tp 2 \
15
+ --ep 2 \
16
+ --host 0.0.0.0 \
17
+ --port 30000 \
18
+ --keep-mm-feature-on-device
19
+ ```
20
+
21
+ ### Non-FP8 (BF16 / full precision) mode
22
+ For deployments on A100/H100 where BF16 is used (or FP8 snapshot not used):
23
+ ```bash
24
+ python3 -m sglang.launch_server \
25
+ --model-path zai-org/GLM-4.6V \
26
+ --tp 4 \
27
+ --ep 4 \
28
+ --host 0.0.0.0 \
29
+ --port 30000
30
+ ```
31
+
32
+ ## Hardware-specific notes / recommendations
33
+
34
+ - On H100 with FP8: Use the FP8 checkpoint for best memory efficiency.
35
+ - On A100 / H100 with BF16 (non-FP8): It’s recommended to use `--mm-max-concurrent-calls` to control parallel throughput and GPU memory usage during image/video inference.
36
+ - On H200 & B200: The model can be run “out of the box”, supporting full context length plus concurrent image + video processing.
37
+
38
+ ## Sending Image/Video Requests
39
+
40
+ ### Image input:
41
+
42
+ ```python
43
+ import requests
44
+
45
+ url = f"http://localhost:30000/v1/chat/completions"
46
+
47
+ data = {
48
+ "model": "zai-org/GLM-4.6V",
49
+ "messages": [
50
+ {
51
+ "role": "user",
52
+ "content": [
53
+ {"type": "text", "text": "What’s in this image?"},
54
+ {
55
+ "type": "image_url",
56
+ "image_url": {
57
+ "url": "https://github.com/sgl-project/sglang/blob/main/examples/assets/example_image.png?raw=true"
58
+ },
59
+ },
60
+ ],
61
+ }
62
+ ],
63
+ "max_tokens": 300,
64
+ }
65
+
66
+ response = requests.post(url, json=data)
67
+ print(response.text)
68
+ ```
69
+
70
+ ### Video Input:
71
+
72
+ ```python
73
+ import requests
74
+
75
+ url = f"http://localhost:30000/v1/chat/completions"
76
+
77
+ data = {
78
+ "model": "zai-org/GLM-4.6V",
79
+ "messages": [
80
+ {
81
+ "role": "user",
82
+ "content": [
83
+ {"type": "text", "text": "What’s happening in this video?"},
84
+ {
85
+ "type": "video_url",
86
+ "video_url": {
87
+ "url": "https://github.com/sgl-project/sgl-test-files/raw/refs/heads/main/videos/jobs_presenting_ipod.mp4"
88
+ },
89
+ },
90
+ ],
91
+ }
92
+ ],
93
+ "max_tokens": 300,
94
+ }
95
+
96
+ response = requests.post(url, json=data)
97
+ print(response.text)
98
+ ```
99
+
100
+ ## Important Server Parameters and Flags
101
+
102
+ When launching the model server for **multimodal support**, you can use the following command-line arguments to fine-tune performance and behavior:
103
+
104
+ - `--mm-attention-backend`: Specify multimodal attention backend. Eg. `fa3`(Flash Attention 3)
105
+ - `--mm-max-concurrent-calls <value>`: Specifies the **maximum number of concurrent asynchronous multimodal data processing calls** allowed on the server. Use this to control parallel throughput and GPU memory usage during image/video inference.
106
+ - `--mm-per-request-timeout <seconds>`: Defines the **timeout duration (in seconds)** for each multimodal request. If a request exceeds this time limit (e.g., for very large video inputs), it will be automatically terminated.
107
+ - `--keep-mm-feature-on-device`: Instructs the server to **retain multimodal feature tensors on the GPU** after processing. This avoids device-to-host (D2H) memory copies and improves performance for repeated or high-frequency inference workloads.
108
+ - `--mm-enable-dp-encoder`: Placing the ViT in data parallel while keeping the LLM in tensor parallel consistently lowers TTFT and boosts end-to-end throughput.
109
+ - `SGLANG_USE_CUDA_IPC_TRANSPORT=1`: Shared memory pool based CUDA IPC for multi-modal data transport. For significantly improving e2e latency.
110
+
111
+ ### Example usage with the above optimizations:
112
+ ```bash
113
+ SGLANG_USE_CUDA_IPC_TRANSPORT=1 \
114
+ SGLANG_VLM_CACHE_SIZE_MB=0 \
115
+ python -m sglang.launch_server \
116
+ --model-path zai-org/GLM-4.6V \
117
+ --host 0.0.0.0 \
118
+ --port 30000 \
119
+ --trust-remote-code \
120
+ --tp-size 8 \
121
+ --enable-cache-report \
122
+ --log-level info \
123
+ --max-running-requests 64 \
124
+ --mem-fraction-static 0.65 \
125
+ --chunked-prefill-size 8192 \
126
+ --attention-backend fa3 \
127
+ --mm-attention-backend fa3 \
128
+ --mm-enable-dp-encoder \
129
+ --enable-metrics
130
+ ```
131
+
132
+ ### Thinking Budget for GLM-4.5V / GLM-4.6V
133
+
134
+ In SGLang, we can implement thinking budget with `CustomLogitProcessor`.
135
+
136
+ Launch a server with the `--enable-custom-logit-processor` flag. Then, use `Glm4MoeThinkingBudgetLogitProcessor` in the request, similar to the `GLM-4.6` example in [glm45.md](./glm45.md).
sglang/docs/basic_usage/gpt_oss.md ADDED
@@ -0,0 +1,147 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # GPT OSS Usage
2
+
3
+ Please refer to [https://github.com/sgl-project/sglang/issues/8833](https://github.com/sgl-project/sglang/issues/8833).
4
+
5
+ ## Responses API & Built-in Tools
6
+
7
+ ### Responses API
8
+
9
+ GPT‑OSS is compatible with the OpenAI Responses API. Use `client.responses.create(...)` with `model`, `instructions`, `input`, and optional `tools` to enable built‑in tool use. You can set reasoning level via `instructions`, e.g., "Reasoning: high" (also supports "medium" and "low") — levels: low (fast), medium (balanced), high (deep).
10
+
11
+ ### Built-in Tools
12
+
13
+ GPT‑OSS can call built‑in tools for web search and Python execution. You can use the demo tool server or connect to external MCP tool servers.
14
+
15
+ #### Python Tool
16
+
17
+ - Executes short Python snippets for calculations, parsing, and quick scripts.
18
+ - By default runs in a Docker-based sandbox. To run on the host, set `PYTHON_EXECUTION_BACKEND=UV` (this executes model-generated code locally; use with care).
19
+ - Ensure Docker is available if you are not using the UV backend. It is recommended to run `docker pull python:3.11` in advance.
20
+
21
+ #### Web Search Tool
22
+
23
+ - Uses the Exa backend for web search.
24
+ - Requires an Exa API key; set `EXA_API_KEY` in your environment. Create a key at `https://exa.ai`.
25
+
26
+ ### Tool & Reasoning Parser
27
+
28
+ - We support OpenAI Reasoning and Tool Call parser, as well as our SGLang native api for tool call and reasoning. Refer to [reasoning parser](../advanced_features/separate_reasoning.ipynb) and [tool call parser](../advanced_features/function_calling.ipynb) for more details.
29
+
30
+
31
+ ## Notes
32
+
33
+ - Use **Python 3.12** for the demo tools. And install the required `gpt-oss` packages.
34
+ - The default demo integrates the web search tool (Exa backend) and a demo Python interpreter via Docker.
35
+ - For search, set `EXA_API_KEY`. For Python execution, either have Docker available or set `PYTHON_EXECUTION_BACKEND=UV`.
36
+
37
+ Examples:
38
+ ```bash
39
+ export EXA_API_KEY=YOUR_EXA_KEY
40
+ # Optional: run Python tool locally instead of Docker (use with care)
41
+ export PYTHON_EXECUTION_BACKEND=UV
42
+ ```
43
+
44
+ Launch the server with the demo tool server:
45
+
46
+ ```bash
47
+ python3 -m sglang.launch_server \
48
+ --model-path openai/gpt-oss-120b \
49
+ --tool-server demo \
50
+ --tp 2
51
+ ```
52
+
53
+ For production usage, sglang can act as an MCP client for multiple services. An [example tool server](https://github.com/openai/gpt-oss/tree/main/gpt-oss-mcp-server) is provided. Start the servers and point sglang to them:
54
+ ```bash
55
+ mcp run -t sse browser_server.py:mcp
56
+ mcp run -t sse python_server.py:mcp
57
+
58
+ python -m sglang.launch_server ... --tool-server ip-1:port-1,ip-2:port-2
59
+ ```
60
+ The URLs should be MCP SSE servers that expose server information and well-documented tools. These tools are added to the system prompt so the model can use them.
61
+
62
+ ## Speculative Decoding
63
+
64
+ SGLang supports speculative decoding for GPT-OSS models using EAGLE3 algorithm. This can significantly improve decoding speed, especially for small batch sizes.
65
+
66
+ **Usage**:
67
+ Add `--speculative-algorithm EAGLE3` along with the draft model path.
68
+ ```bash
69
+ python3 -m sglang.launch_server \
70
+ --model-path openai/gpt-oss-120b \
71
+ --speculative-algorithm EAGLE3 \
72
+ --speculative-draft-model-path lmsys/EAGLE3-gpt-oss-120b-bf16 \
73
+ --tp 2
74
+ ```
75
+
76
+ ```{tip}
77
+ To enable the experimental overlap scheduler for EAGLE3 speculative decoding, set the environment variable `SGLANG_ENABLE_SPEC_V2=1`. This can improve performance by enabling overlap scheduling between draft and verification stages.
78
+ ```
79
+
80
+ ### Quick Demo
81
+
82
+ ```python
83
+ from openai import OpenAI
84
+
85
+ client = OpenAI(
86
+ base_url="http://localhost:30000/v1",
87
+ api_key="sk-123456"
88
+ )
89
+
90
+ tools = [
91
+ {"type": "code_interpreter"},
92
+ {"type": "web_search_preview"},
93
+ ]
94
+
95
+ # Reasoning level example
96
+ response = client.responses.create(
97
+ model="openai/gpt-oss-120b",
98
+ instructions="You are a helpful assistant."
99
+ reasoning_effort="high" # Supports high, medium, or low
100
+ input="In one sentence, explain the transformer architecture.",
101
+ )
102
+ print("====== reasoning: high ======")
103
+ print(response.output_text)
104
+
105
+ # Test python tool
106
+ response = client.responses.create(
107
+ model="openai/gpt-oss-120b",
108
+ instructions="You are a helpful assistant, you could use python tool to execute code.",
109
+ input="Use python tool to calculate the sum of 29138749187 and 29138749187", # 58,277,498,374
110
+ tools=tools
111
+ )
112
+ print("====== test python tool ======")
113
+ print(response.output_text)
114
+
115
+ # Test browser tool
116
+ response = client.responses.create(
117
+ model="openai/gpt-oss-120b",
118
+ instructions="You are a helpful assistant, you could use browser to search the web",
119
+ input="Search the web for the latest news about Nvidia stock price",
120
+ tools=tools
121
+ )
122
+ print("====== test browser tool ======")
123
+ print(response.output_text)
124
+ ```
125
+
126
+ Example output:
127
+ ```
128
+ ====== test python tool ======
129
+ The sum of 29,138,749,187 and 29,138,749,187 is **58,277,498,374**.
130
+ ====== test browser tool ======
131
+ **Recent headlines on Nvidia (NVDA) stock**
132
+
133
+ | Date (2025) | Source | Key news points | Stock‑price detail |
134
+ |-------------|--------|----------------|--------------------|
135
+ | **May 13** | Reuters | The market data page shows Nvidia trading “higher” at **$116.61** with no change from the previous close. | **$116.61** – latest trade (delayed ≈ 15 min)【14†L34-L38】 |
136
+ | **Aug 18** | CNBC | Morgan Stanley kept an **overweight** rating and lifted its price target to **$206** (up from $200), implying a 14 % upside from the Friday close. The firm notes Nvidia shares have already **jumped 34 % this year**. | No exact price quoted, but the article signals strong upside expectations【9†L27-L31】 |
137
+ | **Aug 20** | The Motley Fool | Nvidia is set to release its Q2 earnings on Aug 27. The article lists the **current price of $175.36**, down 0.16 % on the day (as of 3:58 p.m. ET). | **$175.36** – current price on Aug 20【10†L12-L15】【10†L53-L57】 |
138
+
139
+ **What the news tells us**
140
+
141
+ * Nvidia’s share price has risen sharply this year – up roughly a third according to Morgan Stanley – and analysts are still raising targets (now $206).
142
+ * The most recent market quote (Reuters, May 13) was **$116.61**, but the stock has surged since then, reaching **$175.36** by mid‑August.
143
+ * Upcoming earnings on **Aug 27** are a focal point; both the Motley Fool and Morgan Stanley expect the results could keep the rally going.
144
+
145
+ **Bottom line:** Nvidia’s stock is on a strong upward trajectory in 2025, with price targets climbing toward $200‑$210 and the market price already near $175 as of late August.
146
+
147
+ ```
sglang/docs/basic_usage/llama4.md ADDED
@@ -0,0 +1,92 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Llama4 Usage
2
+
3
+ [Llama 4](https://github.com/meta-llama/llama-models/blob/main/models/llama4/MODEL_CARD.md) is Meta's latest generation of open-source LLM model with industry-leading performance.
4
+
5
+ SGLang has supported Llama 4 Scout (109B) and Llama 4 Maverick (400B) since [v0.4.5](https://github.com/sgl-project/sglang/releases/tag/v0.4.5).
6
+
7
+ Ongoing optimizations are tracked in the [Roadmap](https://github.com/sgl-project/sglang/issues/5118).
8
+
9
+ ## Launch Llama 4 with SGLang
10
+
11
+ To serve Llama 4 models on 8xH100/H200 GPUs:
12
+
13
+ ```bash
14
+ python3 -m sglang.launch_server \
15
+ --model-path meta-llama/Llama-4-Scout-17B-16E-Instruct \
16
+ --tp 8 \
17
+ --context-length 1000000
18
+ ```
19
+
20
+ ### Configuration Tips
21
+
22
+ - **OOM Mitigation**: Adjust `--context-length` to avoid a GPU out-of-memory issue. For the Scout model, we recommend setting this value up to 1M on 8\*H100 and up to 2.5M on 8\*H200. For the Maverick model, we don't need to set context length on 8\*H200. When hybrid kv cache is enabled, `--context-length` can be set up to 5M on 8\*H100 and up to 10M on 8\*H200 for the Scout model.
23
+
24
+ - **Attention Backend Auto-Selection**: SGLang automatically selects the optimal attention backend for Llama 4 based on your hardware. You typically don't need to specify `--attention-backend` manually:
25
+ - **Blackwell GPUs (B200/GB200)**: `trtllm_mha`
26
+ - **Hopper GPUs (H100/H200)**: `fa3`
27
+ - **AMD GPUs**: `aiter`
28
+ - **Intel XPU**: `intel_xpu`
29
+ - **Other platforms**: `triton` (fallback)
30
+
31
+ To override the auto-selection, explicitly specify `--attention-backend` with one of the supported backends: `fa3`, `aiter`, `triton`, `trtllm_mha`, or `intel_xpu`.
32
+
33
+ - **Chat Template**: Add `--chat-template llama-4` for chat completion tasks.
34
+ - **Enable Multi-Modal**: Add `--enable-multimodal` for multi-modal capabilities.
35
+ - **Enable Hybrid-KVCache**: Set `--swa-full-tokens-ratio` to adjust the ratio of SWA layer (for Llama4, it's local attention layer) KV tokens / full layer KV tokens. (default: 0.8, range: 0-1)
36
+
37
+
38
+ ### EAGLE Speculative Decoding
39
+ **Description**: SGLang has supported Llama 4 Maverick (400B) with [EAGLE speculative decoding](https://docs.sglang.io/advanced_features/speculative_decoding.html#EAGLE-Decoding).
40
+
41
+ **Usage**:
42
+ Add arguments `--speculative-draft-model-path`, `--speculative-algorithm`, `--speculative-num-steps`, `--speculative-eagle-topk` and `--speculative-num-draft-tokens` to enable this feature. For example:
43
+ ```
44
+ python3 -m sglang.launch_server \
45
+ --model-path meta-llama/Llama-4-Maverick-17B-128E-Instruct \
46
+ --speculative-algorithm EAGLE3 \
47
+ --speculative-draft-model-path nvidia/Llama-4-Maverick-17B-128E-Eagle3 \
48
+ --speculative-num-steps 3 \
49
+ --speculative-eagle-topk 1 \
50
+ --speculative-num-draft-tokens 4 \
51
+ --trust-remote-code \
52
+ --tp 8 \
53
+ --context-length 1000000
54
+ ```
55
+
56
+ - **Note** The Llama 4 draft model *nvidia/Llama-4-Maverick-17B-128E-Eagle3* can only recognize conversations in chat mode.
57
+
58
+ ## Benchmarking Results
59
+
60
+ ### Accuracy Test with `lm_eval`
61
+
62
+ The accuracy on SGLang for both Llama4 Scout and Llama4 Maverick can match the [official benchmark numbers](https://ai.meta.com/blog/llama-4-multimodal-intelligence/).
63
+
64
+ Benchmark results on MMLU Pro dataset with 8*H100:
65
+ | | Llama-4-Scout-17B-16E-Instruct | Llama-4-Maverick-17B-128E-Instruct |
66
+ |--------------------|--------------------------------|-------------------------------------|
67
+ | Official Benchmark | 74.3 | 80.5 |
68
+ | SGLang | 75.2 | 80.7 |
69
+
70
+ Commands:
71
+
72
+ ```bash
73
+ # Llama-4-Scout-17B-16E-Instruct model
74
+ python -m sglang.launch_server \
75
+ --model-path meta-llama/Llama-4-Scout-17B-16E-Instruct \
76
+ --port 30000 \
77
+ --tp 8 \
78
+ --mem-fraction-static 0.8 \
79
+ --context-length 65536
80
+ lm_eval --model local-chat-completions --model_args model=meta-llama/Llama-4-Scout-17B-16E-Instruct,base_url=http://localhost:30000/v1/chat/completions,num_concurrent=128,timeout=999999,max_gen_toks=2048 --tasks mmlu_pro --batch_size 128 --apply_chat_template --num_fewshot 0
81
+
82
+ # Llama-4-Maverick-17B-128E-Instruct
83
+ python -m sglang.launch_server \
84
+ --model-path meta-llama/Llama-4-Maverick-17B-128E-Instruct \
85
+ --port 30000 \
86
+ --tp 8 \
87
+ --mem-fraction-static 0.8 \
88
+ --context-length 65536
89
+ lm_eval --model local-chat-completions --model_args model=meta-llama/Llama-4-Maverick-17B-128E-Instruct,base_url=http://localhost:30000/v1/chat/completions,num_concurrent=128,timeout=999999,max_gen_toks=2048 --tasks mmlu_pro --batch_size 128 --apply_chat_template --num_fewshot 0
90
+ ```
91
+
92
+ Details can be seen in [this PR](https://github.com/sgl-project/sglang/pull/5092).
sglang/docs/basic_usage/minimax_m2.md ADDED
@@ -0,0 +1,85 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # MiniMax M2.5/M2.1/M2 Usage
2
+
3
+ [MiniMax-M2.5](https://huggingface.co/MiniMaxAI/MiniMax-M2.5), [MiniMax-M2.1](https://huggingface.co/MiniMaxAI/MiniMax-M2.1), and [MiniMax-M2](https://huggingface.co/MiniMaxAI/MiniMax-M2) are advanced large language models created by [MiniMax](https://www.minimax.io/).
4
+
5
+ The MiniMax-M2 series redefines efficiency for agents. These compact, fast, and cost-effective MoE models (230 billion total parameters with 10 billion active parameters) are built for elite performance in coding and agentic tasks, all while maintaining powerful general intelligence. With just 10 billion activated parameters, the MiniMax-M2 series provides sophisticated, end-to-end tool use performance expected from today's leading models, but in a streamlined form factor that makes deployment and scaling easier than ever.
6
+
7
+ ## Supported Models
8
+
9
+ This guide applies to the following models. You only need to update the model name during deployment. The following examples use **MiniMax-M2**:
10
+
11
+ - [MiniMaxAI/MiniMax-M2.5](https://huggingface.co/MiniMaxAI/MiniMax-M2.5)
12
+ - [MiniMaxAI/MiniMax-M2.1](https://huggingface.co/MiniMaxAI/MiniMax-M2.1)
13
+ - [MiniMaxAI/MiniMax-M2](https://huggingface.co/MiniMaxAI/MiniMax-M2)
14
+
15
+ ## System Requirements
16
+
17
+ The following are recommended configurations; actual requirements should be adjusted based on your use case:
18
+
19
+ - 4x 96GB GPUs: Supported context length of up to 400K tokens.
20
+ - 8x 144GB GPUs: Supported context length of up to 3M tokens.
21
+
22
+ ## Deployment with Python
23
+
24
+ 4-GPU deployment command:
25
+
26
+ ```bash
27
+ python -m sglang.launch_server \
28
+ --model-path MiniMaxAI/MiniMax-M2 \
29
+ --tp-size 4 \
30
+ --tool-call-parser minimax-m2 \
31
+ --reasoning-parser minimax-append-think \
32
+ --host 0.0.0.0 \
33
+ --trust-remote-code \
34
+ --port 8000 \
35
+ --mem-fraction-static 0.85
36
+ ```
37
+
38
+ 8-GPU deployment command:
39
+
40
+ ```bash
41
+ python -m sglang.launch_server \
42
+ --model-path MiniMaxAI/MiniMax-M2 \
43
+ --tp-size 8 \
44
+ --ep-size 8 \
45
+ --tool-call-parser minimax-m2 \
46
+ --reasoning-parser minimax-append-think \
47
+ --host 0.0.0.0 \
48
+ --trust-remote-code \
49
+ --port 8000 \
50
+ --mem-fraction-static 0.85
51
+ ```
52
+
53
+ ### AMD GPUs (MI300X/MI325X/MI355X)
54
+
55
+ 8-GPU deployment command:
56
+
57
+ ```bash
58
+ SGLANG_USE_AITER=1 python -m sglang.launch_server \
59
+ --model-path MiniMaxAI/MiniMax-M2.5 \
60
+ --tp-size 8 \
61
+ --ep-size 8 \
62
+ --attention-backend aiter \
63
+ --tool-call-parser minimax-m2 \
64
+ --reasoning-parser minimax-append-think \
65
+ --host 0.0.0.0 \
66
+ --trust-remote-code \
67
+ --port 8000 \
68
+ --mem-fraction-static 0.85
69
+ ```
70
+
71
+ ## Testing Deployment
72
+
73
+ After startup, you can test the SGLang OpenAI-compatible API with the following command:
74
+
75
+ ```bash
76
+ curl http://localhost:8000/v1/chat/completions \
77
+ -H "Content-Type: application/json" \
78
+ -d '{
79
+ "model": "MiniMaxAI/MiniMax-M2",
80
+ "messages": [
81
+ {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant."}]},
82
+ {"role": "user", "content": [{"type": "text", "text": "Who won the world series in 2020?"}]}
83
+ ]
84
+ }'
85
+ ```
sglang/docs/basic_usage/native_api.ipynb ADDED
@@ -0,0 +1,667 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "metadata": {},
6
+ "source": [
7
+ "# SGLang Native APIs\n",
8
+ "\n",
9
+ "Apart from the OpenAI compatible APIs, the SGLang Runtime also provides its native server APIs. We introduce the following APIs:\n",
10
+ "\n",
11
+ "- `/generate` (text generation model)\n",
12
+ "- `/get_model_info`\n",
13
+ "- `/get_server_info`\n",
14
+ "- `/health`\n",
15
+ "- `/health_generate`\n",
16
+ "- `/flush_cache`\n",
17
+ "- `/update_weights`\n",
18
+ "- `/encode`(embedding model)\n",
19
+ "- `/v1/rerank`(cross encoder rerank model)\n",
20
+ "- `/v1/score`(decoder-only scoring)\n",
21
+ "- `/classify`(reward model)\n",
22
+ "- `/start_expert_distribution_record`\n",
23
+ "- `/stop_expert_distribution_record`\n",
24
+ "- `/dump_expert_distribution_record`\n",
25
+ "- `/tokenize`\n",
26
+ "- `/detokenize`\n",
27
+ "- A full list of these APIs can be found at [http_server.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/entrypoints/http_server.py)\n",
28
+ "\n",
29
+ "We mainly use `requests` to test these APIs in the following examples. You can also use `curl`.\n"
30
+ ]
31
+ },
32
+ {
33
+ "cell_type": "markdown",
34
+ "metadata": {},
35
+ "source": [
36
+ "## Launch A Server"
37
+ ]
38
+ },
39
+ {
40
+ "cell_type": "code",
41
+ "execution_count": null,
42
+ "metadata": {},
43
+ "outputs": [],
44
+ "source": [
45
+ "from sglang.test.doc_patch import launch_server_cmd\n",
46
+ "from sglang.utils import wait_for_server, print_highlight, terminate_process\n",
47
+ "\n",
48
+ "server_process, port = launch_server_cmd(\n",
49
+ " \"python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct --host 0.0.0.0 --log-level warning\"\n",
50
+ ")\n",
51
+ "\n",
52
+ "wait_for_server(f\"http://localhost:{port}\", process=server_process)"
53
+ ]
54
+ },
55
+ {
56
+ "cell_type": "markdown",
57
+ "metadata": {},
58
+ "source": [
59
+ "## Generate (text generation model)\n",
60
+ "Generate completions. This is similar to the `/v1/completions` in OpenAI API. Detailed parameters can be found in the [sampling parameters](sampling_params.md)."
61
+ ]
62
+ },
63
+ {
64
+ "cell_type": "code",
65
+ "execution_count": null,
66
+ "metadata": {},
67
+ "outputs": [],
68
+ "source": [
69
+ "import requests\n",
70
+ "\n",
71
+ "url = f\"http://localhost:{port}/generate\"\n",
72
+ "data = {\"text\": \"What is the capital of France?\"}\n",
73
+ "\n",
74
+ "response = requests.post(url, json=data)\n",
75
+ "print_highlight(response.json())"
76
+ ]
77
+ },
78
+ {
79
+ "cell_type": "markdown",
80
+ "metadata": {},
81
+ "source": [
82
+ "## Get Model Info\n",
83
+ "\n",
84
+ "Get the information of the model.\n",
85
+ "\n",
86
+ "- `model_path`: The path/name of the model.\n",
87
+ "- `is_generation`: Whether the model is used as generation model or embedding model.\n",
88
+ "- `tokenizer_path`: The path/name of the tokenizer.\n",
89
+ "- `preferred_sampling_params`: The default sampling params specified via `--preferred-sampling-params`. `None` is returned in this example as we did not explicitly configure it in server args.\n",
90
+ "- `weight_version`: This field contains the version of the model weights. This is often used to track changes or updates to the model’s trained parameters.\n",
91
+ "- `has_image_understanding`: Whether the model has image-understanding capability.\n",
92
+ "- `has_audio_understanding`: Whether the model has audio-understanding capability.\n",
93
+ "- `model_type`: The model type from the HuggingFace config (e.g., \"qwen2\", \"llama\").\n",
94
+ "- `architectures`: The model architectures from the HuggingFace config (e.g., [\"Qwen2ForCausalLM\"])."
95
+ ]
96
+ },
97
+ {
98
+ "cell_type": "code",
99
+ "execution_count": null,
100
+ "metadata": {},
101
+ "outputs": [],
102
+ "source": [
103
+ "url = f\"http://localhost:{port}/get_model_info\"\n",
104
+ "\n",
105
+ "response = requests.get(url)\n",
106
+ "response_json = response.json()\n",
107
+ "print_highlight(response_json)\n",
108
+ "assert response_json[\"model_path\"] == \"qwen/qwen2.5-0.5b-instruct\"\n",
109
+ "assert response_json[\"is_generation\"] is True\n",
110
+ "assert response_json[\"tokenizer_path\"] == \"qwen/qwen2.5-0.5b-instruct\"\n",
111
+ "assert response_json[\"preferred_sampling_params\"] is None\n",
112
+ "assert response_json.keys() == {\n",
113
+ " \"model_path\",\n",
114
+ " \"is_generation\",\n",
115
+ " \"tokenizer_path\",\n",
116
+ " \"preferred_sampling_params\",\n",
117
+ " \"weight_version\",\n",
118
+ " \"has_image_understanding\",\n",
119
+ " \"has_audio_understanding\",\n",
120
+ " \"model_type\",\n",
121
+ " \"architectures\",\n",
122
+ "}"
123
+ ]
124
+ },
125
+ {
126
+ "cell_type": "markdown",
127
+ "metadata": {},
128
+ "source": [
129
+ "## Get Server Info\n",
130
+ "Gets the server information including CLI arguments, token limits, and memory pool sizes.\n",
131
+ "- Note: `get_server_info` merges the following deprecated endpoints:\n",
132
+ " - `get_server_args`\n",
133
+ " - `get_memory_pool_size`\n",
134
+ " - `get_max_total_num_tokens`"
135
+ ]
136
+ },
137
+ {
138
+ "cell_type": "code",
139
+ "execution_count": null,
140
+ "metadata": {},
141
+ "outputs": [],
142
+ "source": [
143
+ "url = f\"http://localhost:{port}/get_server_info\"\n",
144
+ "\n",
145
+ "response = requests.get(url)\n",
146
+ "print_highlight(response.text)"
147
+ ]
148
+ },
149
+ {
150
+ "cell_type": "markdown",
151
+ "metadata": {},
152
+ "source": [
153
+ "## Health Check\n",
154
+ "- `/health`: Check the health of the server.\n",
155
+ "- `/health_generate`: Check the health of the server by generating one token."
156
+ ]
157
+ },
158
+ {
159
+ "cell_type": "code",
160
+ "execution_count": null,
161
+ "metadata": {},
162
+ "outputs": [],
163
+ "source": [
164
+ "url = f\"http://localhost:{port}/health_generate\"\n",
165
+ "\n",
166
+ "response = requests.get(url)\n",
167
+ "print_highlight(response.text)"
168
+ ]
169
+ },
170
+ {
171
+ "cell_type": "code",
172
+ "execution_count": null,
173
+ "metadata": {},
174
+ "outputs": [],
175
+ "source": [
176
+ "url = f\"http://localhost:{port}/health\"\n",
177
+ "\n",
178
+ "response = requests.get(url)\n",
179
+ "print_highlight(response.text)"
180
+ ]
181
+ },
182
+ {
183
+ "cell_type": "markdown",
184
+ "metadata": {},
185
+ "source": [
186
+ "## Flush Cache\n",
187
+ "\n",
188
+ "Flush the radix cache. It will be automatically triggered when the model weights are updated by the `/update_weights` API."
189
+ ]
190
+ },
191
+ {
192
+ "cell_type": "code",
193
+ "execution_count": null,
194
+ "metadata": {},
195
+ "outputs": [],
196
+ "source": [
197
+ "url = f\"http://localhost:{port}/flush_cache\"\n",
198
+ "\n",
199
+ "response = requests.post(url)\n",
200
+ "print_highlight(response.text)"
201
+ ]
202
+ },
203
+ {
204
+ "cell_type": "markdown",
205
+ "metadata": {},
206
+ "source": [
207
+ "## Update Weights From Disk\n",
208
+ "\n",
209
+ "Update model weights from disk without restarting the server. Only applicable for models with the same architecture and parameter size.\n",
210
+ "\n",
211
+ "SGLang support `update_weights_from_disk` API for continuous evaluation during training (save checkpoint to disk and update weights from disk).\n"
212
+ ]
213
+ },
214
+ {
215
+ "cell_type": "code",
216
+ "execution_count": null,
217
+ "metadata": {},
218
+ "outputs": [],
219
+ "source": [
220
+ "# successful update with same architecture and size\n",
221
+ "\n",
222
+ "url = f\"http://localhost:{port}/update_weights_from_disk\"\n",
223
+ "data = {\"model_path\": \"qwen/qwen2.5-0.5b-instruct\"}\n",
224
+ "\n",
225
+ "response = requests.post(url, json=data)\n",
226
+ "print_highlight(response.text)\n",
227
+ "assert response.json()[\"success\"] is True\n",
228
+ "assert response.json()[\"message\"] == \"Succeeded to update model weights.\""
229
+ ]
230
+ },
231
+ {
232
+ "cell_type": "code",
233
+ "execution_count": null,
234
+ "metadata": {},
235
+ "outputs": [],
236
+ "source": [
237
+ "# failed update with different parameter size or wrong name\n",
238
+ "\n",
239
+ "url = f\"http://localhost:{port}/update_weights_from_disk\"\n",
240
+ "data = {\"model_path\": \"qwen/qwen2.5-0.5b-instruct-wrong\"}\n",
241
+ "\n",
242
+ "response = requests.post(url, json=data)\n",
243
+ "response_json = response.json()\n",
244
+ "print_highlight(response_json)\n",
245
+ "assert response_json[\"success\"] is False\n",
246
+ "assert response_json[\"message\"] == (\n",
247
+ " \"Failed to get weights iterator: \"\n",
248
+ " \"qwen/qwen2.5-0.5b-instruct-wrong\"\n",
249
+ " \" (repository not found).\"\n",
250
+ ")"
251
+ ]
252
+ },
253
+ {
254
+ "cell_type": "code",
255
+ "execution_count": null,
256
+ "metadata": {},
257
+ "outputs": [],
258
+ "source": [
259
+ "terminate_process(server_process)"
260
+ ]
261
+ },
262
+ {
263
+ "cell_type": "markdown",
264
+ "metadata": {},
265
+ "source": [
266
+ "## Encode (embedding model)\n",
267
+ "\n",
268
+ "Encode text into embeddings. Note that this API is only available for [embedding models](openai_api_embeddings.ipynb) and will raise an error for generation models.\n",
269
+ "Therefore, we launch a new server to server an embedding model."
270
+ ]
271
+ },
272
+ {
273
+ "cell_type": "code",
274
+ "execution_count": null,
275
+ "metadata": {},
276
+ "outputs": [],
277
+ "source": [
278
+ "embedding_process, port = launch_server_cmd(\"\"\"\n",
279
+ "python3 -m sglang.launch_server --model-path Alibaba-NLP/gte-Qwen2-1.5B-instruct \\\n",
280
+ " --host 0.0.0.0 --is-embedding --log-level warning\n",
281
+ "\"\"\")\n",
282
+ "\n",
283
+ "wait_for_server(f\"http://localhost:{port}\", process=embedding_process)"
284
+ ]
285
+ },
286
+ {
287
+ "cell_type": "code",
288
+ "execution_count": null,
289
+ "metadata": {},
290
+ "outputs": [],
291
+ "source": [
292
+ "# successful encode for embedding model\n",
293
+ "\n",
294
+ "url = f\"http://localhost:{port}/encode\"\n",
295
+ "data = {\"model\": \"Alibaba-NLP/gte-Qwen2-1.5B-instruct\", \"text\": \"Once upon a time\"}\n",
296
+ "\n",
297
+ "response = requests.post(url, json=data)\n",
298
+ "response_json = response.json()\n",
299
+ "print_highlight(f\"Text embedding (first 10): {response_json['embedding'][:10]}\")"
300
+ ]
301
+ },
302
+ {
303
+ "cell_type": "code",
304
+ "execution_count": null,
305
+ "metadata": {},
306
+ "outputs": [],
307
+ "source": [
308
+ "terminate_process(embedding_process)"
309
+ ]
310
+ },
311
+ {
312
+ "cell_type": "markdown",
313
+ "metadata": {},
314
+ "source": [
315
+ "## v1/rerank (cross encoder rerank model)\n",
316
+ "Rerank a list of documents given a query using a cross-encoder model. Note that this API is only available for cross encoder model like [BAAI/bge-reranker-v2-m3](https://huggingface.co/BAAI/bge-reranker-v2-m3) with `attention-backend` `triton` and `torch_native`.\n"
317
+ ]
318
+ },
319
+ {
320
+ "cell_type": "code",
321
+ "execution_count": null,
322
+ "metadata": {},
323
+ "outputs": [],
324
+ "source": [
325
+ "reranker_process, port = launch_server_cmd(\"\"\"\n",
326
+ "python3 -m sglang.launch_server --model-path BAAI/bge-reranker-v2-m3 \\\n",
327
+ " --host 0.0.0.0 --disable-radix-cache --chunked-prefill-size -1 --attention-backend triton --is-embedding --log-level warning\n",
328
+ "\"\"\")\n",
329
+ "\n",
330
+ "wait_for_server(f\"http://localhost:{port}\", process=reranker_process)"
331
+ ]
332
+ },
333
+ {
334
+ "cell_type": "code",
335
+ "execution_count": null,
336
+ "metadata": {},
337
+ "outputs": [],
338
+ "source": [
339
+ "# compute rerank scores for query and documents\n",
340
+ "\n",
341
+ "url = f\"http://localhost:{port}/v1/rerank\"\n",
342
+ "data = {\n",
343
+ " \"model\": \"BAAI/bge-reranker-v2-m3\",\n",
344
+ " \"query\": \"what is panda?\",\n",
345
+ " \"documents\": [\n",
346
+ " \"hi\",\n",
347
+ " \"The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.\",\n",
348
+ " ],\n",
349
+ "}\n",
350
+ "\n",
351
+ "response = requests.post(url, json=data)\n",
352
+ "response_json = response.json()\n",
353
+ "for item in response_json:\n",
354
+ " print_highlight(f\"Score: {item['score']:.2f} - Document: '{item['document']}'\")"
355
+ ]
356
+ },
357
+ {
358
+ "cell_type": "code",
359
+ "execution_count": null,
360
+ "metadata": {},
361
+ "outputs": [],
362
+ "source": [
363
+ "terminate_process(reranker_process)"
364
+ ]
365
+ },
366
+ {
367
+ "cell_type": "markdown",
368
+ "metadata": {},
369
+ "source": [
370
+ "## v1/score (decoder-only scoring)\n",
371
+ "\n",
372
+ "Compute token probabilities for specified tokens given a query and items. This is useful for classification tasks, scoring responses, or computing log-probabilities.\n",
373
+ "\n",
374
+ "Parameters:\n",
375
+ "- `query`: Query text\n",
376
+ "- `items`: Item text(s) to score\n",
377
+ "- `label_token_ids`: Token IDs to compute probabilities for\n",
378
+ "- `apply_softmax`: Whether to apply softmax to get normalized probabilities (default: False)\n",
379
+ "- `item_first`: Whether items come first in concatenation order (default: False)\n",
380
+ "- `model`: Model name\n",
381
+ "\n",
382
+ "The response contains `scores` - a list of probability lists, one per item, each in the order of `label_token_ids`."
383
+ ]
384
+ },
385
+ {
386
+ "cell_type": "code",
387
+ "execution_count": null,
388
+ "metadata": {},
389
+ "outputs": [],
390
+ "source": [
391
+ "score_process, port = launch_server_cmd(\"\"\"\n",
392
+ "python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct \\\n",
393
+ " --host 0.0.0.0 --log-level warning\n",
394
+ "\"\"\")\n",
395
+ "\n",
396
+ "wait_for_server(f\"http://localhost:{port}\", process=score_process)"
397
+ ]
398
+ },
399
+ {
400
+ "cell_type": "code",
401
+ "execution_count": null,
402
+ "metadata": {},
403
+ "outputs": [],
404
+ "source": [
405
+ "# Score the probability of different completions given a query\n",
406
+ "query = \"The capital of France is\"\n",
407
+ "items = [\"Paris\", \"London\", \"Berlin\"]\n",
408
+ "\n",
409
+ "url = f\"http://localhost:{port}/v1/score\"\n",
410
+ "data = {\n",
411
+ " \"model\": \"qwen/qwen2.5-0.5b-instruct\",\n",
412
+ " \"query\": query,\n",
413
+ " \"items\": items,\n",
414
+ " \"label_token_ids\": [9454, 2753], # e.g. \"Yes\" and \"No\" token ids\n",
415
+ " \"apply_softmax\": True, # Normalize probabilities to sum to 1\n",
416
+ "}\n",
417
+ "\n",
418
+ "response = requests.post(url, json=data)\n",
419
+ "response_json = response.json()\n",
420
+ "\n",
421
+ "# Display scores for each item\n",
422
+ "for item, scores in zip(items, response_json[\"scores\"]):\n",
423
+ " print_highlight(f\"Item '{item}': probabilities = {[f'{s:.4f}' for s in scores]}\")"
424
+ ]
425
+ },
426
+ {
427
+ "cell_type": "code",
428
+ "execution_count": null,
429
+ "metadata": {},
430
+ "outputs": [],
431
+ "source": [
432
+ "terminate_process(score_process)"
433
+ ]
434
+ },
435
+ {
436
+ "cell_type": "markdown",
437
+ "metadata": {},
438
+ "source": [
439
+ "## Classify (reward model)\n",
440
+ "\n",
441
+ "SGLang Runtime also supports reward models. Here we use a reward model to classify the quality of pairwise generations."
442
+ ]
443
+ },
444
+ {
445
+ "cell_type": "code",
446
+ "execution_count": null,
447
+ "metadata": {},
448
+ "outputs": [],
449
+ "source": [
450
+ "# Note that SGLang now treats embedding models and reward models as the same type of models.\n",
451
+ "# This will be updated in the future.\n",
452
+ "\n",
453
+ "reward_process, port = launch_server_cmd(\"\"\"\n",
454
+ "python3 -m sglang.launch_server --model-path Skywork/Skywork-Reward-Llama-3.1-8B-v0.2 --host 0.0.0.0 --is-embedding --log-level warning\n",
455
+ "\"\"\")\n",
456
+ "\n",
457
+ "wait_for_server(f\"http://localhost:{port}\", process=reward_process)"
458
+ ]
459
+ },
460
+ {
461
+ "cell_type": "code",
462
+ "execution_count": null,
463
+ "metadata": {},
464
+ "outputs": [],
465
+ "source": [
466
+ "from transformers import AutoTokenizer\n",
467
+ "\n",
468
+ "PROMPT = (\n",
469
+ " \"What is the range of the numeric output of a sigmoid node in a neural network?\"\n",
470
+ ")\n",
471
+ "\n",
472
+ "RESPONSE1 = \"The output of a sigmoid node is bounded between -1 and 1.\"\n",
473
+ "RESPONSE2 = \"The output of a sigmoid node is bounded between 0 and 1.\"\n",
474
+ "\n",
475
+ "CONVS = [\n",
476
+ " [{\"role\": \"user\", \"content\": PROMPT}, {\"role\": \"assistant\", \"content\": RESPONSE1}],\n",
477
+ " [{\"role\": \"user\", \"content\": PROMPT}, {\"role\": \"assistant\", \"content\": RESPONSE2}],\n",
478
+ "]\n",
479
+ "\n",
480
+ "tokenizer = AutoTokenizer.from_pretrained(\"Skywork/Skywork-Reward-Llama-3.1-8B-v0.2\")\n",
481
+ "prompts = tokenizer.apply_chat_template(CONVS, tokenize=False, return_dict=False)\n",
482
+ "\n",
483
+ "url = f\"http://localhost:{port}/classify\"\n",
484
+ "data = {\"model\": \"Skywork/Skywork-Reward-Llama-3.1-8B-v0.2\", \"text\": prompts}\n",
485
+ "\n",
486
+ "responses = requests.post(url, json=data).json()\n",
487
+ "for response in responses:\n",
488
+ " print_highlight(f\"reward: {response['embedding'][0]}\")"
489
+ ]
490
+ },
491
+ {
492
+ "cell_type": "code",
493
+ "execution_count": null,
494
+ "metadata": {},
495
+ "outputs": [],
496
+ "source": [
497
+ "terminate_process(reward_process)"
498
+ ]
499
+ },
500
+ {
501
+ "cell_type": "markdown",
502
+ "metadata": {},
503
+ "source": [
504
+ "## Capture expert selection distribution in MoE models\n",
505
+ "\n",
506
+ "SGLang Runtime supports recording the number of times an expert is selected in a MoE model run for each expert in the model. This is useful when analyzing the throughput of the model and plan for optimization.\n",
507
+ "\n",
508
+ "*Note: We only print out the first 10 lines of the csv below for better readability. Please adjust accordingly if you want to analyze the results more deeply.*"
509
+ ]
510
+ },
511
+ {
512
+ "cell_type": "code",
513
+ "execution_count": null,
514
+ "metadata": {},
515
+ "outputs": [],
516
+ "source": [
517
+ "expert_record_server_process, port = launch_server_cmd(\n",
518
+ " \"python3 -m sglang.launch_server --model-path Qwen/Qwen1.5-MoE-A2.7B --host 0.0.0.0 --expert-distribution-recorder-mode stat --log-level warning\"\n",
519
+ ")\n",
520
+ "\n",
521
+ "wait_for_server(f\"http://localhost:{port}\", process=expert_record_server_process)"
522
+ ]
523
+ },
524
+ {
525
+ "cell_type": "code",
526
+ "execution_count": null,
527
+ "metadata": {},
528
+ "outputs": [],
529
+ "source": [
530
+ "response = requests.post(f\"http://localhost:{port}/start_expert_distribution_record\")\n",
531
+ "print_highlight(response)\n",
532
+ "\n",
533
+ "url = f\"http://localhost:{port}/generate\"\n",
534
+ "data = {\"text\": \"What is the capital of France?\"}\n",
535
+ "\n",
536
+ "response = requests.post(url, json=data)\n",
537
+ "print_highlight(response.json())\n",
538
+ "\n",
539
+ "response = requests.post(f\"http://localhost:{port}/stop_expert_distribution_record\")\n",
540
+ "print_highlight(response)\n",
541
+ "\n",
542
+ "response = requests.post(f\"http://localhost:{port}/dump_expert_distribution_record\")\n",
543
+ "print_highlight(response)"
544
+ ]
545
+ },
546
+ {
547
+ "cell_type": "code",
548
+ "execution_count": null,
549
+ "metadata": {},
550
+ "outputs": [],
551
+ "source": [
552
+ "terminate_process(expert_record_server_process)"
553
+ ]
554
+ },
555
+ {
556
+ "cell_type": "markdown",
557
+ "metadata": {},
558
+ "source": [
559
+ "## Tokenize/Detokenize Example (Round Trip)\n",
560
+ "\n",
561
+ "This example demonstrates how to use the /tokenize and /detokenize endpoints together. We first tokenize a string, then detokenize the resulting IDs to reconstruct the original text. This workflow is useful when you need to handle tokenization externally but still leverage the server for detokenization."
562
+ ]
563
+ },
564
+ {
565
+ "cell_type": "code",
566
+ "execution_count": null,
567
+ "metadata": {},
568
+ "outputs": [],
569
+ "source": [
570
+ "tokenizer_free_server_process, port = launch_server_cmd(\"\"\"\n",
571
+ "python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct\n",
572
+ "\"\"\")\n",
573
+ "\n",
574
+ "wait_for_server(f\"http://localhost:{port}\", process=tokenizer_free_server_process)"
575
+ ]
576
+ },
577
+ {
578
+ "cell_type": "code",
579
+ "execution_count": null,
580
+ "metadata": {},
581
+ "outputs": [],
582
+ "source": [
583
+ "import requests\n",
584
+ "from sglang.utils import print_highlight\n",
585
+ "\n",
586
+ "base_url = f\"http://localhost:{port}\"\n",
587
+ "tokenize_url = f\"{base_url}/tokenize\"\n",
588
+ "detokenize_url = f\"{base_url}/detokenize\"\n",
589
+ "\n",
590
+ "model_name = \"qwen/qwen2.5-0.5b-instruct\"\n",
591
+ "input_text = \"SGLang provides efficient tokenization endpoints.\"\n",
592
+ "print_highlight(f\"Original Input Text:\\n'{input_text}'\")\n",
593
+ "\n",
594
+ "# --- tokenize the input text ---\n",
595
+ "tokenize_payload = {\n",
596
+ " \"model\": model_name,\n",
597
+ " \"prompt\": input_text,\n",
598
+ " \"add_special_tokens\": False,\n",
599
+ "}\n",
600
+ "try:\n",
601
+ " tokenize_response = requests.post(tokenize_url, json=tokenize_payload)\n",
602
+ " tokenize_response.raise_for_status()\n",
603
+ " tokenization_result = tokenize_response.json()\n",
604
+ " token_ids = tokenization_result.get(\"tokens\")\n",
605
+ "\n",
606
+ " if not token_ids:\n",
607
+ " raise ValueError(\"Tokenization returned empty tokens.\")\n",
608
+ "\n",
609
+ " print_highlight(f\"\\nTokenized Output (IDs):\\n{token_ids}\")\n",
610
+ " print_highlight(f\"Token Count: {tokenization_result.get('count')}\")\n",
611
+ " print_highlight(f\"Max Model Length: {tokenization_result.get('max_model_len')}\")\n",
612
+ "\n",
613
+ " # --- detokenize the obtained token IDs ---\n",
614
+ " detokenize_payload = {\n",
615
+ " \"model\": model_name,\n",
616
+ " \"tokens\": token_ids,\n",
617
+ " \"skip_special_tokens\": True,\n",
618
+ " }\n",
619
+ "\n",
620
+ " detokenize_response = requests.post(detokenize_url, json=detokenize_payload)\n",
621
+ " detokenize_response.raise_for_status()\n",
622
+ " detokenization_result = detokenize_response.json()\n",
623
+ " reconstructed_text = detokenization_result.get(\"text\")\n",
624
+ "\n",
625
+ " print_highlight(f\"\\nDetokenized Output (Text):\\n'{reconstructed_text}'\")\n",
626
+ "\n",
627
+ " if input_text == reconstructed_text:\n",
628
+ " print_highlight(\n",
629
+ " \"\\nRound Trip Successful: Original and reconstructed text match.\"\n",
630
+ " )\n",
631
+ " else:\n",
632
+ " print_highlight(\n",
633
+ " \"\\nRound Trip Mismatch: Original and reconstructed text differ.\"\n",
634
+ " )\n",
635
+ "\n",
636
+ "except requests.exceptions.RequestException as e:\n",
637
+ " print_highlight(f\"\\nHTTP Request Error: {e}\")\n",
638
+ "except Exception as e:\n",
639
+ " print_highlight(f\"\\nAn error occurred: {e}\")"
640
+ ]
641
+ },
642
+ {
643
+ "cell_type": "code",
644
+ "execution_count": null,
645
+ "metadata": {},
646
+ "outputs": [],
647
+ "source": [
648
+ "terminate_process(tokenizer_free_server_process)"
649
+ ]
650
+ }
651
+ ],
652
+ "metadata": {
653
+ "language_info": {
654
+ "codemirror_mode": {
655
+ "name": "ipython",
656
+ "version": 3
657
+ },
658
+ "file_extension": ".py",
659
+ "mimetype": "text/x-python",
660
+ "name": "python",
661
+ "nbconvert_exporter": "python",
662
+ "pygments_lexer": "ipython3"
663
+ }
664
+ },
665
+ "nbformat": 4,
666
+ "nbformat_minor": 4
667
+ }
sglang/docs/basic_usage/offline_engine_api.ipynb ADDED
@@ -0,0 +1,235 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "metadata": {},
6
+ "source": [
7
+ "# Offline Engine API\n",
8
+ "\n",
9
+ "SGLang provides a direct inference engine without the need for an HTTP server, especially for use cases where additional HTTP server adds unnecessary complexity or overhead. Here are two general use cases:\n",
10
+ "\n",
11
+ "- Offline Batch Inference\n",
12
+ "- Custom Server on Top of the Engine\n",
13
+ "\n",
14
+ "This document focuses on the offline batch inference, demonstrating four different inference modes:\n",
15
+ "\n",
16
+ "- Non-streaming synchronous generation\n",
17
+ "- Streaming synchronous generation\n",
18
+ "- Non-streaming asynchronous generation\n",
19
+ "- Streaming asynchronous generation\n",
20
+ "\n",
21
+ "Additionally, you can easily build a custom server on top of the SGLang offline engine. A detailed example working in a python script can be found in [custom_server](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/custom_server.py).\n",
22
+ "\n"
23
+ ]
24
+ },
25
+ {
26
+ "cell_type": "markdown",
27
+ "metadata": {},
28
+ "source": [
29
+ "## Nest Asyncio\n",
30
+ "Note that if you want to use **Offline Engine** in ipython or some other nested loop code, you need to add the following code:\n",
31
+ "```python\n",
32
+ "import nest_asyncio\n",
33
+ "\n",
34
+ "nest_asyncio.apply()\n",
35
+ "\n",
36
+ "```"
37
+ ]
38
+ },
39
+ {
40
+ "cell_type": "markdown",
41
+ "metadata": {},
42
+ "source": [
43
+ "## Advanced Usage\n",
44
+ "\n",
45
+ "The engine supports [vlm inference](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py) as well as [extracting hidden states](https://github.com/sgl-project/sglang/blob/main/examples/runtime/hidden_states). \n",
46
+ "\n",
47
+ "Please see [the examples](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) for further use cases."
48
+ ]
49
+ },
50
+ {
51
+ "cell_type": "markdown",
52
+ "metadata": {},
53
+ "source": [
54
+ "## Offline Batch Inference\n",
55
+ "\n",
56
+ "SGLang offline engine supports batch inference with efficient scheduling."
57
+ ]
58
+ },
59
+ {
60
+ "cell_type": "code",
61
+ "execution_count": null,
62
+ "metadata": {},
63
+ "outputs": [],
64
+ "source": [
65
+ "# launch the offline engine\n",
66
+ "import asyncio\n",
67
+ "\n",
68
+ "import sglang as sgl\n",
69
+ "import sglang.test.doc_patch\n",
70
+ "from sglang.utils import async_stream_and_merge, stream_and_merge\n",
71
+ "\n",
72
+ "llm = sgl.Engine(model_path=\"qwen/qwen2.5-0.5b-instruct\")"
73
+ ]
74
+ },
75
+ {
76
+ "cell_type": "markdown",
77
+ "metadata": {},
78
+ "source": [
79
+ "### Non-streaming Synchronous Generation"
80
+ ]
81
+ },
82
+ {
83
+ "cell_type": "code",
84
+ "execution_count": null,
85
+ "metadata": {},
86
+ "outputs": [],
87
+ "source": [
88
+ "prompts = [\n",
89
+ " \"Hello, my name is\",\n",
90
+ " \"The president of the United States is\",\n",
91
+ " \"The capital of France is\",\n",
92
+ " \"The future of AI is\",\n",
93
+ "]\n",
94
+ "\n",
95
+ "sampling_params = {\"temperature\": 0.8, \"top_p\": 0.95}\n",
96
+ "\n",
97
+ "outputs = llm.generate(prompts, sampling_params)\n",
98
+ "for prompt, output in zip(prompts, outputs):\n",
99
+ " print(\"===============================\")\n",
100
+ " print(f\"Prompt: {prompt}\\nGenerated text: {output['text']}\")"
101
+ ]
102
+ },
103
+ {
104
+ "cell_type": "markdown",
105
+ "metadata": {},
106
+ "source": [
107
+ "### Streaming Synchronous Generation"
108
+ ]
109
+ },
110
+ {
111
+ "cell_type": "code",
112
+ "execution_count": null,
113
+ "metadata": {},
114
+ "outputs": [],
115
+ "source": [
116
+ "prompts = [\n",
117
+ " \"Write a short, neutral self-introduction for a fictional character. Hello, my name is\",\n",
118
+ " \"Provide a concise factual statement about France’s capital city. The capital of France is\",\n",
119
+ " \"Explain possible future trends in artificial intelligence. The future of AI is\",\n",
120
+ "]\n",
121
+ "\n",
122
+ "sampling_params = {\n",
123
+ " \"temperature\": 0.2,\n",
124
+ " \"top_p\": 0.9,\n",
125
+ "}\n",
126
+ "\n",
127
+ "print(\"\\n=== Testing synchronous streaming generation with overlap removal ===\\n\")\n",
128
+ "\n",
129
+ "for prompt in prompts:\n",
130
+ " print(f\"Prompt: {prompt}\")\n",
131
+ " merged_output = stream_and_merge(llm, prompt, sampling_params)\n",
132
+ " print(\"Generated text:\", merged_output)\n",
133
+ " print()"
134
+ ]
135
+ },
136
+ {
137
+ "cell_type": "markdown",
138
+ "metadata": {},
139
+ "source": [
140
+ "### Non-streaming Asynchronous Generation"
141
+ ]
142
+ },
143
+ {
144
+ "cell_type": "code",
145
+ "execution_count": null,
146
+ "metadata": {},
147
+ "outputs": [],
148
+ "source": [
149
+ "prompts = [\n",
150
+ " \"Write a short, neutral self-introduction for a fictional character. Hello, my name is\",\n",
151
+ " \"Provide a concise factual statement about France’s capital city. The capital of France is\",\n",
152
+ " \"Explain possible future trends in artificial intelligence. The future of AI is\",\n",
153
+ "]\n",
154
+ "\n",
155
+ "sampling_params = {\"temperature\": 0.8, \"top_p\": 0.95}\n",
156
+ "\n",
157
+ "print(\"\\n=== Testing asynchronous batch generation ===\")\n",
158
+ "\n",
159
+ "\n",
160
+ "async def main():\n",
161
+ " outputs = await llm.async_generate(prompts, sampling_params)\n",
162
+ "\n",
163
+ " for prompt, output in zip(prompts, outputs):\n",
164
+ " print(f\"\\nPrompt: {prompt}\")\n",
165
+ " print(f\"Generated text: {output['text']}\")\n",
166
+ "\n",
167
+ "\n",
168
+ "asyncio.run(main())"
169
+ ]
170
+ },
171
+ {
172
+ "cell_type": "markdown",
173
+ "metadata": {},
174
+ "source": [
175
+ "### Streaming Asynchronous Generation"
176
+ ]
177
+ },
178
+ {
179
+ "cell_type": "code",
180
+ "execution_count": null,
181
+ "metadata": {},
182
+ "outputs": [],
183
+ "source": [
184
+ "prompts = [\n",
185
+ " \"Write a short, neutral self-introduction for a fictional character. Hello, my name is\",\n",
186
+ " \"Provide a concise factual statement about France’s capital city. The capital of France is\",\n",
187
+ " \"Explain possible future trends in artificial intelligence. The future of AI is\",\n",
188
+ "]\n",
189
+ "\n",
190
+ "sampling_params = {\"temperature\": 0.8, \"top_p\": 0.95}\n",
191
+ "\n",
192
+ "print(\"\\n=== Testing asynchronous streaming generation (no repeats) ===\")\n",
193
+ "\n",
194
+ "\n",
195
+ "async def main():\n",
196
+ " for prompt in prompts:\n",
197
+ " print(f\"\\nPrompt: {prompt}\")\n",
198
+ " print(\"Generated text: \", end=\"\", flush=True)\n",
199
+ "\n",
200
+ " # Replace direct calls to async_generate with our custom overlap-aware version\n",
201
+ " async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):\n",
202
+ " print(cleaned_chunk, end=\"\", flush=True)\n",
203
+ "\n",
204
+ " print() # New line after each prompt\n",
205
+ "\n",
206
+ "\n",
207
+ "asyncio.run(main())"
208
+ ]
209
+ },
210
+ {
211
+ "cell_type": "code",
212
+ "execution_count": null,
213
+ "metadata": {},
214
+ "outputs": [],
215
+ "source": [
216
+ "llm.shutdown()"
217
+ ]
218
+ }
219
+ ],
220
+ "metadata": {
221
+ "language_info": {
222
+ "codemirror_mode": {
223
+ "name": "ipython",
224
+ "version": 3
225
+ },
226
+ "file_extension": ".py",
227
+ "mimetype": "text/x-python",
228
+ "name": "python",
229
+ "nbconvert_exporter": "python",
230
+ "pygments_lexer": "ipython3"
231
+ }
232
+ },
233
+ "nbformat": 4,
234
+ "nbformat_minor": 2
235
+ }
sglang/docs/basic_usage/ollama_api.md ADDED
@@ -0,0 +1,91 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Ollama-Compatible API
2
+
3
+ SGLang provides Ollama API compatibility, allowing you to use the Ollama CLI and Python library with SGLang as the inference backend.
4
+
5
+ ## Prerequisites
6
+
7
+ ```bash
8
+ # Install the Ollama Python library (for Python client usage)
9
+ pip install ollama
10
+ ```
11
+
12
+ > **Note**: You don't need the Ollama server installed - SGLang acts as the backend. You only need the `ollama` CLI or Python library as the client.
13
+
14
+ ## Endpoints
15
+
16
+ | Endpoint | Method | Description |
17
+ |----------|--------|-------------|
18
+ | `/` | GET, HEAD | Health check for Ollama CLI |
19
+ | `/api/tags` | GET | List available models |
20
+ | `/api/chat` | POST | Chat completions (streaming & non-streaming) |
21
+ | `/api/generate` | POST | Text generation (streaming & non-streaming) |
22
+ | `/api/show` | POST | Model information |
23
+
24
+ ## Quick Start
25
+
26
+ ### 1. Launch SGLang Server
27
+
28
+ ```bash
29
+ python -m sglang.launch_server \
30
+ --model Qwen/Qwen2.5-1.5B-Instruct \
31
+ --port 30001 \
32
+ --host 0.0.0.0
33
+ ```
34
+
35
+ > **Note**: The model name used with `ollama run` must match exactly what you passed to `--model`.
36
+
37
+ ### 2. Use Ollama CLI
38
+
39
+ ```bash
40
+ # List available models
41
+ OLLAMA_HOST=http://localhost:30001 ollama list
42
+
43
+ # Interactive chat
44
+ OLLAMA_HOST=http://localhost:30001 ollama run "Qwen/Qwen2.5-1.5B-Instruct"
45
+ ```
46
+
47
+ If connecting to a remote server behind a firewall:
48
+
49
+ ```bash
50
+ # SSH tunnel
51
+ ssh -L 30001:localhost:30001 user@gpu-server -N &
52
+
53
+ # Then use Ollama CLI as above
54
+ OLLAMA_HOST=http://localhost:30001 ollama list
55
+ ```
56
+
57
+ ### 3. Use Ollama Python Library
58
+
59
+ ```python
60
+ import ollama
61
+
62
+ client = ollama.Client(host='http://localhost:30001')
63
+
64
+ # Non-streaming
65
+ response = client.chat(
66
+ model='Qwen/Qwen2.5-1.5B-Instruct',
67
+ messages=[{'role': 'user', 'content': 'Hello!'}]
68
+ )
69
+ print(response['message']['content'])
70
+
71
+ # Streaming
72
+ stream = client.chat(
73
+ model='Qwen/Qwen2.5-1.5B-Instruct',
74
+ messages=[{'role': 'user', 'content': 'Tell me a story'}],
75
+ stream=True
76
+ )
77
+ for chunk in stream:
78
+ print(chunk['message']['content'], end='', flush=True)
79
+ ```
80
+
81
+ ## Smart Router
82
+
83
+ For intelligent routing between local Ollama (fast) and remote SGLang (powerful) using an LLM judge, see the [Smart Router documentation](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/entrypoints/ollama/README.md).
84
+
85
+ ## Summary
86
+
87
+ | Component | Purpose |
88
+ |-----------|---------|
89
+ | **Ollama API** | Familiar CLI/API that developers already know |
90
+ | **SGLang Backend** | High-performance inference engine |
91
+ | **Smart Router** | Intelligent routing - fast local for simple tasks, powerful remote for complex tasks |
sglang/docs/basic_usage/openai_api.rst ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ OpenAI-Compatible APIs
2
+ ======================
3
+
4
+ .. toctree::
5
+ :maxdepth: 1
6
+
7
+ openai_api_completions.ipynb
8
+ openai_api_vision.ipynb
9
+ openai_api_embeddings.ipynb
sglang/docs/basic_usage/openai_api_completions.ipynb ADDED
@@ -0,0 +1,552 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "metadata": {},
6
+ "source": [
7
+ "# OpenAI APIs - Completions\n",
8
+ "\n",
9
+ "SGLang provides OpenAI-compatible APIs to enable a smooth transition from OpenAI services to self-hosted local models.\n",
10
+ "A complete reference for the API is available in the [OpenAI API Reference](https://platform.openai.com/docs/api-reference).\n",
11
+ "\n",
12
+ "This tutorial covers the following popular APIs:\n",
13
+ "\n",
14
+ "- `chat/completions`\n",
15
+ "- `completions`\n",
16
+ "\n",
17
+ "Check out other tutorials to learn about [vision APIs](openai_api_vision.ipynb) for vision-language models and [embedding APIs](openai_api_embeddings.ipynb) for embedding models."
18
+ ]
19
+ },
20
+ {
21
+ "cell_type": "markdown",
22
+ "metadata": {},
23
+ "source": [
24
+ "## Launch A Server\n",
25
+ "\n",
26
+ "Launch the server in your terminal and wait for it to initialize."
27
+ ]
28
+ },
29
+ {
30
+ "cell_type": "code",
31
+ "execution_count": null,
32
+ "metadata": {},
33
+ "outputs": [],
34
+ "source": [
35
+ "from sglang.test.doc_patch import launch_server_cmd\n",
36
+ "from sglang.utils import wait_for_server, print_highlight, terminate_process\n",
37
+ "\n",
38
+ "server_process, port = launch_server_cmd(\n",
39
+ " \"python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct --host 0.0.0.0 --log-level warning\"\n",
40
+ ")\n",
41
+ "\n",
42
+ "wait_for_server(f\"http://localhost:{port}\", process=server_process)\n",
43
+ "print(f\"Server started on http://localhost:{port}\")"
44
+ ]
45
+ },
46
+ {
47
+ "cell_type": "markdown",
48
+ "metadata": {},
49
+ "source": [
50
+ "## Chat Completions\n",
51
+ "\n",
52
+ "### Usage\n",
53
+ "\n",
54
+ "The server fully implements the OpenAI API.\n",
55
+ "It will automatically apply the chat template specified in the Hugging Face tokenizer, if one is available.\n",
56
+ "You can also specify a custom chat template with `--chat-template` when launching the server."
57
+ ]
58
+ },
59
+ {
60
+ "cell_type": "code",
61
+ "execution_count": null,
62
+ "metadata": {},
63
+ "outputs": [],
64
+ "source": [
65
+ "import openai\n",
66
+ "\n",
67
+ "client = openai.Client(base_url=f\"http://127.0.0.1:{port}/v1\", api_key=\"None\")\n",
68
+ "\n",
69
+ "response = client.chat.completions.create(\n",
70
+ " model=\"qwen/qwen2.5-0.5b-instruct\",\n",
71
+ " messages=[\n",
72
+ " {\"role\": \"user\", \"content\": \"List 3 countries and their capitals.\"},\n",
73
+ " ],\n",
74
+ " temperature=0,\n",
75
+ " max_tokens=64,\n",
76
+ ")\n",
77
+ "\n",
78
+ "print_highlight(f\"Response: {response}\")"
79
+ ]
80
+ },
81
+ {
82
+ "cell_type": "markdown",
83
+ "metadata": {},
84
+ "source": [
85
+ "### Model Thinking/Reasoning Support\n",
86
+ "\n",
87
+ "Some models support internal reasoning or thinking processes that can be exposed in the API response. SGLang provides unified support for various reasoning models through the `chat_template_kwargs` parameter and compatible reasoning parsers.\n",
88
+ "\n",
89
+ "#### Supported Models and Configuration\n",
90
+ "\n",
91
+ "| Model Family | Chat Template Parameter | Reasoning Parser | Notes |\n",
92
+ "|--------------|------------------------|------------------|--------|\n",
93
+ "| DeepSeek-R1 (R1, R1-0528, R1-Distill) | `enable_thinking` | `--reasoning-parser deepseek-r1` | Standard reasoning models |\n",
94
+ "| DeepSeek-V3.1 | `thinking` | `--reasoning-parser deepseek-v3` | Hybrid model (thinking/non-thinking modes) |\n",
95
+ "| Qwen3 (standard) | `enable_thinking` | `--reasoning-parser qwen3` | Hybrid model (thinking/non-thinking modes) |\n",
96
+ "| Qwen3-Thinking | N/A (always enabled) | `--reasoning-parser qwen3-thinking` | Always generates reasoning |\n",
97
+ "| Kimi | N/A (always enabled) | `--reasoning-parser kimi` | Kimi thinking models |\n",
98
+ "| Gpt-Oss | N/A (always enabled) | `--reasoning-parser gpt-oss` | Gpt-Oss thinking models |\n",
99
+ "\n",
100
+ "#### Basic Usage\n",
101
+ "\n",
102
+ "To enable reasoning output, you need to:\n",
103
+ "1. Launch the server with the appropriate reasoning parser\n",
104
+ "2. Set the model-specific parameter in `chat_template_kwargs`\n",
105
+ "3. Optionally use `separate_reasoning: False` to not get reasoning content separately (default to `True`)\n",
106
+ "\n",
107
+ "**Note for Qwen3-Thinking models:** These models always generate thinking content and do not support the `enable_thinking` parameter. Use `--reasoning-parser qwen3-thinking` or `--reasoning-parser qwen3` to parse the thinking content.\n"
108
+ ]
109
+ },
110
+ {
111
+ "cell_type": "markdown",
112
+ "metadata": {},
113
+ "source": [
114
+ "#### Example: Qwen3 Models\n",
115
+ "\n",
116
+ "```python\n",
117
+ "# Launch server:\n",
118
+ "# python3 -m sglang.launch_server --model Qwen/Qwen3-4B --reasoning-parser qwen3\n",
119
+ "\n",
120
+ "from openai import OpenAI\n",
121
+ "\n",
122
+ "client = OpenAI(\n",
123
+ " api_key=\"EMPTY\",\n",
124
+ " base_url=f\"http://127.0.0.1:30000/v1\",\n",
125
+ ")\n",
126
+ "\n",
127
+ "model = \"Qwen/Qwen3-4B\"\n",
128
+ "messages = [{\"role\": \"user\", \"content\": \"How many r's are in 'strawberry'?\"}]\n",
129
+ "\n",
130
+ "response = client.chat.completions.create(\n",
131
+ " model=model,\n",
132
+ " messages=messages,\n",
133
+ " extra_body={\n",
134
+ " \"chat_template_kwargs\": {\"enable_thinking\": True},\n",
135
+ " \"separate_reasoning\": True\n",
136
+ " }\n",
137
+ ")\n",
138
+ "\n",
139
+ "print(\"Reasoning:\", response.choices[0].message.reasoning_content)\n",
140
+ "print(\"-\"*100)\n",
141
+ "print(\"Answer:\", response.choices[0].message.content)\n",
142
+ "```\n",
143
+ "\n",
144
+ "**ExampleOutput:**\n",
145
+ "```\n",
146
+ "Reasoning: Okay, so the user is asking how many 'r's are in the word 'strawberry'. Let me think. First, I need to make sure I have the word spelled correctly. Strawberry... S-T-R-A-W-B-E-R-R-Y. Wait, is that right? Let me break it down.\n",
147
+ "\n",
148
+ "Starting with 'strawberry', let's write out the letters one by one. S, T, R, A, W, B, E, R, R, Y. Hmm, wait, that's 10 letters. Let me check again. S (1), T (2), R (3), A (4), W (5), B (6), E (7), R (8), R (9), Y (10). So the letters are S-T-R-A-W-B-E-R-R-Y. \n",
149
+ "...\n",
150
+ "Therefore, the answer should be three R's in 'strawberry'. But I need to make sure I'm not counting any other letters as R. Let me check again. S, T, R, A, W, B, E, R, R, Y. No other R's. So three in total. Yeah, that seems right.\n",
151
+ "\n",
152
+ "----------------------------------------------------------------------------------------------------\n",
153
+ "Answer: The word \"strawberry\" contains **three** letters 'r'. Here's the breakdown:\n",
154
+ "\n",
155
+ "1. **S-T-R-A-W-B-E-R-R-Y** \n",
156
+ " - The **third letter** is 'R'. \n",
157
+ " - The **eighth and ninth letters** are also 'R's. \n",
158
+ "\n",
159
+ "Thus, the total count is **3**. \n",
160
+ "\n",
161
+ "**Answer:** 3.\n",
162
+ "```\n",
163
+ "\n",
164
+ "**Note:** Setting `\"enable_thinking\": False` (or omitting it) will result in `reasoning_content` being `None`. Qwen3-Thinking models always generate reasoning content and don't support the `enable_thinking` parameter.\n"
165
+ ]
166
+ },
167
+ {
168
+ "cell_type": "markdown",
169
+ "metadata": {},
170
+ "source": [
171
+ "#### Logit Bias Support\n",
172
+ "\n",
173
+ "SGLang supports the `logit_bias` parameter for both chat completions and completions APIs. This parameter allows you to modify the likelihood of specific tokens being generated by adding bias values to their logits. The bias values can range from -100 to 100, where:\n",
174
+ "\n",
175
+ "- **Positive values** (0 to 100) increase the likelihood of the token being selected\n",
176
+ "- **Negative values** (-100 to 0) decrease the likelihood of the token being selected\n",
177
+ "- **-100** effectively prevents the token from being generated\n",
178
+ "\n",
179
+ "The `logit_bias` parameter accepts a dictionary where keys are token IDs (as strings) and values are the bias amounts (as floats).\n"
180
+ ]
181
+ },
182
+ {
183
+ "cell_type": "markdown",
184
+ "metadata": {},
185
+ "source": [
186
+ "#### Getting Token IDs\n",
187
+ "\n",
188
+ "To use `logit_bias` effectively, you need to know the token IDs for the words you want to bias. Here's how to get token IDs:\n",
189
+ "\n",
190
+ "```python\n",
191
+ "# Get tokenizer to find token IDs\n",
192
+ "import tiktoken\n",
193
+ "\n",
194
+ "# For OpenAI models, use the appropriate encoding\n",
195
+ "tokenizer = tiktoken.encoding_for_model(\"gpt-3.5-turbo\") # or your model\n",
196
+ "\n",
197
+ "# Get token IDs for specific words\n",
198
+ "word = \"sunny\"\n",
199
+ "token_ids = tokenizer.encode(word)\n",
200
+ "print(f\"Token IDs for '{word}': {token_ids}\")\n",
201
+ "\n",
202
+ "# For SGLang models, you can access the tokenizer through the client\n",
203
+ "# and get token IDs for bias\n",
204
+ "```\n",
205
+ "\n",
206
+ "**Important:** The `logit_bias` parameter uses token IDs as string keys, not the actual words.\n"
207
+ ]
208
+ },
209
+ {
210
+ "cell_type": "markdown",
211
+ "metadata": {},
212
+ "source": [
213
+ "#### Example: DeepSeek-V3 Models\n",
214
+ "\n",
215
+ "DeepSeek-V3 models support thinking mode through the `thinking` parameter:\n",
216
+ "\n",
217
+ "```python\n",
218
+ "# Launch server:\n",
219
+ "# python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3.1 --tp 8 --reasoning-parser deepseek-v3\n",
220
+ "\n",
221
+ "from openai import OpenAI\n",
222
+ "\n",
223
+ "client = OpenAI(\n",
224
+ " api_key=\"EMPTY\",\n",
225
+ " base_url=f\"http://127.0.0.1:30000/v1\",\n",
226
+ ")\n",
227
+ "\n",
228
+ "model = \"deepseek-ai/DeepSeek-V3.1\"\n",
229
+ "messages = [{\"role\": \"user\", \"content\": \"How many r's are in 'strawberry'?\"}]\n",
230
+ "\n",
231
+ "response = client.chat.completions.create(\n",
232
+ " model=model,\n",
233
+ " messages=messages,\n",
234
+ " extra_body={\n",
235
+ " \"chat_template_kwargs\": {\"thinking\": True},\n",
236
+ " \"separate_reasoning\": True\n",
237
+ " }\n",
238
+ ")\n",
239
+ "\n",
240
+ "print(\"Reasoning:\", response.choices[0].message.reasoning_content)\n",
241
+ "print(\"-\"*100)\n",
242
+ "print(\"Answer:\", response.choices[0].message.content)\n",
243
+ "```\n",
244
+ "\n",
245
+ "**Example Output:**\n",
246
+ "```\n",
247
+ "Reasoning: First, the question is: \"How many r's are in 'strawberry'?\"\n",
248
+ "\n",
249
+ "I need to count the number of times the letter 'r' appears in the word \"strawberry\".\n",
250
+ "\n",
251
+ "Let me write out the word: S-T-R-A-W-B-E-R-R-Y.\n",
252
+ "\n",
253
+ "Now, I'll go through each letter and count the 'r's.\n",
254
+ "...\n",
255
+ "So, I have three 'r's in \"strawberry\".\n",
256
+ "\n",
257
+ "I should double-check. The word is spelled S-T-R-A-W-B-E-R-R-Y. The letters are at positions: 3, 8, and 9 are 'r's. Yes, that's correct.\n",
258
+ "\n",
259
+ "Therefore, the answer should be 3.\n",
260
+ "----------------------------------------------------------------------------------------------------\n",
261
+ "Answer: The word \"strawberry\" contains **3** instances of the letter \"r\". Here's a breakdown for clarity:\n",
262
+ "\n",
263
+ "- The word is spelled: S-T-R-A-W-B-E-R-R-Y\n",
264
+ "- The \"r\" appears at the 3rd, 8th, and 9th positions.\n",
265
+ "```\n",
266
+ "\n",
267
+ "**Note:** DeepSeek-V3 models use the `thinking` parameter (not `enable_thinking`) to control reasoning output.\n"
268
+ ]
269
+ },
270
+ {
271
+ "cell_type": "code",
272
+ "execution_count": null,
273
+ "metadata": {},
274
+ "outputs": [],
275
+ "source": [
276
+ "# Example with logit_bias parameter\n",
277
+ "# Note: You need to get the actual token IDs from your tokenizer\n",
278
+ "# For demonstration, we'll use some example token IDs\n",
279
+ "response = client.chat.completions.create(\n",
280
+ " model=\"qwen/qwen2.5-0.5b-instruct\",\n",
281
+ " messages=[\n",
282
+ " {\"role\": \"user\", \"content\": \"Complete this sentence: The weather today is\"}\n",
283
+ " ],\n",
284
+ " temperature=0.7,\n",
285
+ " max_tokens=20,\n",
286
+ " logit_bias={\n",
287
+ " \"12345\": 50, # Increase likelihood of token ID 12345\n",
288
+ " \"67890\": -50, # Decrease likelihood of token ID 67890\n",
289
+ " \"11111\": 25, # Slightly increase likelihood of token ID 11111\n",
290
+ " },\n",
291
+ ")\n",
292
+ "\n",
293
+ "print_highlight(f\"Response with logit bias: {response.choices[0].message.content}\")"
294
+ ]
295
+ },
296
+ {
297
+ "cell_type": "markdown",
298
+ "metadata": {},
299
+ "source": [
300
+ "### Parameters\n",
301
+ "\n",
302
+ "The chat completions API accepts OpenAI Chat Completions API's parameters. Refer to [OpenAI Chat Completions API](https://platform.openai.com/docs/api-reference/chat/create) for more details.\n",
303
+ "\n",
304
+ "SGLang extends the standard API with the `extra_body` parameter, allowing for additional customization. One key option within `extra_body` is `chat_template_kwargs`, which can be used to pass arguments to the chat template processor."
305
+ ]
306
+ },
307
+ {
308
+ "cell_type": "code",
309
+ "execution_count": null,
310
+ "metadata": {},
311
+ "outputs": [],
312
+ "source": [
313
+ "response = client.chat.completions.create(\n",
314
+ " model=\"qwen/qwen2.5-0.5b-instruct\",\n",
315
+ " messages=[\n",
316
+ " {\n",
317
+ " \"role\": \"system\",\n",
318
+ " \"content\": \"You are a knowledgeable historian who provides concise responses.\",\n",
319
+ " },\n",
320
+ " {\"role\": \"user\", \"content\": \"Tell me about ancient Rome\"},\n",
321
+ " {\n",
322
+ " \"role\": \"assistant\",\n",
323
+ " \"content\": \"Ancient Rome was a civilization centered in Italy.\",\n",
324
+ " },\n",
325
+ " {\"role\": \"user\", \"content\": \"What were their major achievements?\"},\n",
326
+ " ],\n",
327
+ " temperature=0.3, # Lower temperature for more focused responses\n",
328
+ " max_tokens=128, # Reasonable length for a concise response\n",
329
+ " top_p=0.95, # Slightly higher for better fluency\n",
330
+ " presence_penalty=0.2, # Mild penalty to avoid repetition\n",
331
+ " frequency_penalty=0.2, # Mild penalty for more natural language\n",
332
+ " n=1, # Single response is usually more stable\n",
333
+ " seed=42, # Keep for reproducibility\n",
334
+ ")\n",
335
+ "\n",
336
+ "print_highlight(response.choices[0].message.content)"
337
+ ]
338
+ },
339
+ {
340
+ "cell_type": "markdown",
341
+ "metadata": {},
342
+ "source": [
343
+ "Streaming mode is also supported."
344
+ ]
345
+ },
346
+ {
347
+ "cell_type": "markdown",
348
+ "metadata": {},
349
+ "source": [
350
+ "#### Logit Bias Support\n",
351
+ "\n",
352
+ "The completions API also supports the `logit_bias` parameter with the same functionality as described in the chat completions section above.\n"
353
+ ]
354
+ },
355
+ {
356
+ "cell_type": "code",
357
+ "execution_count": null,
358
+ "metadata": {},
359
+ "outputs": [],
360
+ "source": [
361
+ "stream = client.chat.completions.create(\n",
362
+ " model=\"qwen/qwen2.5-0.5b-instruct\",\n",
363
+ " messages=[{\"role\": \"user\", \"content\": \"Say this is a test\"}],\n",
364
+ " stream=True,\n",
365
+ ")\n",
366
+ "for chunk in stream:\n",
367
+ " if chunk.choices[0].delta.content is not None:\n",
368
+ " print(chunk.choices[0].delta.content, end=\"\")"
369
+ ]
370
+ },
371
+ {
372
+ "cell_type": "markdown",
373
+ "metadata": {},
374
+ "source": [
375
+ "#### Returning Routed Experts (MoE Models)\n",
376
+ "\n",
377
+ "For MoE models, set `return_routed_experts: true` in `extra_body` to return expert routing data. Requires `--enable-return-routed-experts` server flag. The `routed_experts` field will be returned in the `sgl_ext` object on each choice, containing base64-encoded int32 expert IDs as a flattened array with logical shape `[num_tokens, num_layers, top_k]`."
378
+ ]
379
+ },
380
+ {
381
+ "cell_type": "code",
382
+ "execution_count": null,
383
+ "metadata": {},
384
+ "outputs": [],
385
+ "source": [
386
+ "# Example with logit_bias parameter for completions API\n",
387
+ "# Note: You need to get the actual token IDs from your tokenizer\n",
388
+ "# For demonstration, we'll use some example token IDs\n",
389
+ "response = client.completions.create(\n",
390
+ " model=\"qwen/qwen2.5-0.5b-instruct\",\n",
391
+ " prompt=\"The best programming language for AI is\",\n",
392
+ " temperature=0.7,\n",
393
+ " max_tokens=20,\n",
394
+ " logit_bias={\n",
395
+ " \"12345\": 75, # Strongly favor token ID 12345\n",
396
+ " \"67890\": -100, # Completely avoid token ID 67890\n",
397
+ " \"11111\": -25, # Slightly discourage token ID 11111\n",
398
+ " },\n",
399
+ ")\n",
400
+ "\n",
401
+ "print_highlight(f\"Response with logit bias: {response.choices[0].text}\")"
402
+ ]
403
+ },
404
+ {
405
+ "cell_type": "markdown",
406
+ "metadata": {},
407
+ "source": [
408
+ "## Completions\n",
409
+ "\n",
410
+ "### Usage\n",
411
+ "Completions API is similar to Chat Completions API, but without the `messages` parameter or chat templates."
412
+ ]
413
+ },
414
+ {
415
+ "cell_type": "code",
416
+ "execution_count": null,
417
+ "metadata": {},
418
+ "outputs": [],
419
+ "source": [
420
+ "response = client.completions.create(\n",
421
+ " model=\"qwen/qwen2.5-0.5b-instruct\",\n",
422
+ " prompt=\"List 3 countries and their capitals.\",\n",
423
+ " temperature=0,\n",
424
+ " max_tokens=64,\n",
425
+ " n=1,\n",
426
+ " stop=None,\n",
427
+ ")\n",
428
+ "\n",
429
+ "print_highlight(f\"Response: {response}\")"
430
+ ]
431
+ },
432
+ {
433
+ "cell_type": "markdown",
434
+ "metadata": {},
435
+ "source": [
436
+ "### Parameters\n",
437
+ "\n",
438
+ "The completions API accepts OpenAI Completions API's parameters. Refer to [OpenAI Completions API](https://platform.openai.com/docs/api-reference/completions/create) for more details.\n",
439
+ "\n",
440
+ "Here is an example of a detailed completions request:"
441
+ ]
442
+ },
443
+ {
444
+ "cell_type": "code",
445
+ "execution_count": null,
446
+ "metadata": {},
447
+ "outputs": [],
448
+ "source": [
449
+ "response = client.completions.create(\n",
450
+ " model=\"qwen/qwen2.5-0.5b-instruct\",\n",
451
+ " prompt=\"Write a short story about a space explorer.\",\n",
452
+ " temperature=0.7, # Moderate temperature for creative writing\n",
453
+ " max_tokens=150, # Longer response for a story\n",
454
+ " top_p=0.9, # Balanced diversity in word choice\n",
455
+ " stop=[\"\\n\\n\", \"THE END\"], # Multiple stop sequences\n",
456
+ " presence_penalty=0.3, # Encourage novel elements\n",
457
+ " frequency_penalty=0.3, # Reduce repetitive phrases\n",
458
+ " n=1, # Generate one completion\n",
459
+ " seed=123, # For reproducible results\n",
460
+ ")\n",
461
+ "\n",
462
+ "print_highlight(f\"Response: {response}\")"
463
+ ]
464
+ },
465
+ {
466
+ "cell_type": "markdown",
467
+ "metadata": {},
468
+ "source": [
469
+ "#### Returning Routed Experts (MoE Models)\n",
470
+ "\n",
471
+ "For MoE models, set `return_routed_experts: true` in `extra_body` to return expert routing data. Requires `--enable-return-routed-experts` server flag. The `routed_experts` field will be returned in the `sgl_ext` object on each choice, containing base64-encoded int32 expert IDs as a flattened array with logical shape `[num_tokens, num_layers, top_k]`."
472
+ ]
473
+ },
474
+ {
475
+ "cell_type": "markdown",
476
+ "metadata": {},
477
+ "source": [
478
+ "## Structured Outputs (JSON, Regex, EBNF)\n",
479
+ "\n",
480
+ "For OpenAI compatible structured outputs API, refer to [Structured Outputs](../advanced_features/structured_outputs.ipynb) for more details.\n"
481
+ ]
482
+ },
483
+ {
484
+ "cell_type": "markdown",
485
+ "metadata": {},
486
+ "source": [
487
+ "## Using LoRA Adapters\n",
488
+ "\n",
489
+ "SGLang supports LoRA (Low-Rank Adaptation) adapters with OpenAI-compatible APIs. You can specify which adapter to use directly in the `model` parameter using the `base-model:adapter-name` syntax.\n",
490
+ "\n",
491
+ "**Server Setup:**\n",
492
+ "```bash\n",
493
+ "python -m sglang.launch_server \\\n",
494
+ " --model-path qwen/qwen2.5-0.5b-instruct \\\n",
495
+ " --enable-lora \\\n",
496
+ " --lora-paths adapter_a=/path/to/adapter_a adapter_b=/path/to/adapter_b\n",
497
+ "```\n",
498
+ "\n",
499
+ "For more details on LoRA serving configuration, see the [LoRA documentation](../advanced_features/lora.ipynb).\n",
500
+ "\n",
501
+ "**API Call:**\n",
502
+ "\n",
503
+ "(Recommended) Use the `model:adapter` syntax to specify which adapter to use:\n",
504
+ "```python\n",
505
+ "response = client.chat.completions.create(\n",
506
+ " model=\"qwen/qwen2.5-0.5b-instruct:adapter_a\", # ← base-model:adapter-name\n",
507
+ " messages=[{\"role\": \"user\", \"content\": \"Convert to SQL: show all users\"}],\n",
508
+ " max_tokens=50,\n",
509
+ ")\n",
510
+ "```\n",
511
+ "\n",
512
+ "**Backward Compatible: Using `extra_body`**\n",
513
+ "\n",
514
+ "The old `extra_body` method is still supported for backward compatibility:\n",
515
+ "```python\n",
516
+ "# Backward compatible method\n",
517
+ "response = client.chat.completions.create(\n",
518
+ " model=\"qwen/qwen2.5-0.5b-instruct\",\n",
519
+ " messages=[{\"role\": \"user\", \"content\": \"Convert to SQL: show all users\"}],\n",
520
+ " extra_body={\"lora_path\": \"adapter_a\"}, # ← old method\n",
521
+ " max_tokens=50,\n",
522
+ ")\n",
523
+ "```\n",
524
+ "**Note:** When both `model:adapter` and `extra_body[\"lora_path\"]` are specified, the `model:adapter` syntax takes precedence."
525
+ ]
526
+ },
527
+ {
528
+ "cell_type": "code",
529
+ "execution_count": null,
530
+ "metadata": {},
531
+ "outputs": [],
532
+ "source": [
533
+ "terminate_process(server_process)"
534
+ ]
535
+ }
536
+ ],
537
+ "metadata": {
538
+ "language_info": {
539
+ "codemirror_mode": {
540
+ "name": "ipython",
541
+ "version": 3
542
+ },
543
+ "file_extension": ".py",
544
+ "mimetype": "text/x-python",
545
+ "name": "python",
546
+ "nbconvert_exporter": "python",
547
+ "pygments_lexer": "ipython3"
548
+ }
549
+ },
550
+ "nbformat": 4,
551
+ "nbformat_minor": 2
552
+ }
sglang/docs/basic_usage/openai_api_embeddings.ipynb ADDED
@@ -0,0 +1,193 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "metadata": {},
6
+ "source": [
7
+ "# OpenAI APIs - Embedding\n",
8
+ "\n",
9
+ "SGLang provides OpenAI-compatible APIs to enable a smooth transition from OpenAI services to self-hosted local models.\n",
10
+ "A complete reference for the API is available in the [OpenAI API Reference](https://platform.openai.com/docs/guides/embeddings).\n",
11
+ "\n",
12
+ "This tutorial covers the embedding APIs for embedding models. For a list of the supported models see the [corresponding overview page](../supported_models/retrieval_ranking/embedding_models.md)\n"
13
+ ]
14
+ },
15
+ {
16
+ "cell_type": "markdown",
17
+ "metadata": {},
18
+ "source": [
19
+ "## Launch A Server\n",
20
+ "\n",
21
+ "Launch the server in your terminal and wait for it to initialize. Remember to add `--is-embedding` to the command."
22
+ ]
23
+ },
24
+ {
25
+ "cell_type": "code",
26
+ "execution_count": null,
27
+ "metadata": {},
28
+ "outputs": [],
29
+ "source": [
30
+ "from sglang.test.doc_patch import launch_server_cmd\n",
31
+ "from sglang.utils import wait_for_server, print_highlight, terminate_process\n",
32
+ "\n",
33
+ "embedding_process, port = launch_server_cmd(\"\"\"\n",
34
+ "python3 -m sglang.launch_server --model-path Alibaba-NLP/gte-Qwen2-1.5B-instruct \\\n",
35
+ " --host 0.0.0.0 --is-embedding --log-level warning\n",
36
+ "\"\"\")\n",
37
+ "\n",
38
+ "wait_for_server(f\"http://localhost:{port}\", process=embedding_process)"
39
+ ]
40
+ },
41
+ {
42
+ "cell_type": "markdown",
43
+ "metadata": {},
44
+ "source": [
45
+ "## Using cURL"
46
+ ]
47
+ },
48
+ {
49
+ "cell_type": "code",
50
+ "execution_count": null,
51
+ "metadata": {},
52
+ "outputs": [],
53
+ "source": [
54
+ "import subprocess, json\n",
55
+ "\n",
56
+ "text = \"Once upon a time\"\n",
57
+ "\n",
58
+ "curl_text = f\"\"\"curl -s http://localhost:{port}/v1/embeddings \\\n",
59
+ " -H \"Content-Type: application/json\" \\\n",
60
+ " -d '{{\"model\": \"Alibaba-NLP/gte-Qwen2-1.5B-instruct\", \"input\": \"{text}\"}}'\"\"\"\n",
61
+ "\n",
62
+ "result = subprocess.check_output(curl_text, shell=True)\n",
63
+ "\n",
64
+ "print(result)\n",
65
+ "\n",
66
+ "text_embedding = json.loads(result)[\"data\"][0][\"embedding\"]\n",
67
+ "\n",
68
+ "print_highlight(f\"Text embedding (first 10): {text_embedding[:10]}\")"
69
+ ]
70
+ },
71
+ {
72
+ "cell_type": "markdown",
73
+ "metadata": {},
74
+ "source": [
75
+ "## Using Python Requests"
76
+ ]
77
+ },
78
+ {
79
+ "cell_type": "code",
80
+ "execution_count": null,
81
+ "metadata": {},
82
+ "outputs": [],
83
+ "source": [
84
+ "import requests\n",
85
+ "\n",
86
+ "text = \"Once upon a time\"\n",
87
+ "\n",
88
+ "response = requests.post(\n",
89
+ " f\"http://localhost:{port}/v1/embeddings\",\n",
90
+ " json={\"model\": \"Alibaba-NLP/gte-Qwen2-1.5B-instruct\", \"input\": text},\n",
91
+ ")\n",
92
+ "\n",
93
+ "text_embedding = response.json()[\"data\"][0][\"embedding\"]\n",
94
+ "\n",
95
+ "print_highlight(f\"Text embedding (first 10): {text_embedding[:10]}\")"
96
+ ]
97
+ },
98
+ {
99
+ "cell_type": "markdown",
100
+ "metadata": {},
101
+ "source": [
102
+ "## Using OpenAI Python Client"
103
+ ]
104
+ },
105
+ {
106
+ "cell_type": "code",
107
+ "execution_count": null,
108
+ "metadata": {},
109
+ "outputs": [],
110
+ "source": [
111
+ "import openai\n",
112
+ "\n",
113
+ "client = openai.Client(base_url=f\"http://127.0.0.1:{port}/v1\", api_key=\"None\")\n",
114
+ "\n",
115
+ "# Text embedding example\n",
116
+ "response = client.embeddings.create(\n",
117
+ " model=\"Alibaba-NLP/gte-Qwen2-1.5B-instruct\",\n",
118
+ " input=text,\n",
119
+ ")\n",
120
+ "\n",
121
+ "embedding = response.data[0].embedding[:10]\n",
122
+ "print_highlight(f\"Text embedding (first 10): {embedding}\")"
123
+ ]
124
+ },
125
+ {
126
+ "cell_type": "markdown",
127
+ "metadata": {},
128
+ "source": [
129
+ "## Using Input IDs\n",
130
+ "\n",
131
+ "SGLang also supports `input_ids` as input to get the embedding."
132
+ ]
133
+ },
134
+ {
135
+ "cell_type": "code",
136
+ "execution_count": null,
137
+ "metadata": {},
138
+ "outputs": [],
139
+ "source": [
140
+ "import json\n",
141
+ "import os\n",
142
+ "from transformers import AutoTokenizer\n",
143
+ "\n",
144
+ "os.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\"\n",
145
+ "\n",
146
+ "tokenizer = AutoTokenizer.from_pretrained(\"Alibaba-NLP/gte-Qwen2-1.5B-instruct\")\n",
147
+ "input_ids = tokenizer.encode(text)\n",
148
+ "\n",
149
+ "curl_ids = f\"\"\"curl -s http://localhost:{port}/v1/embeddings \\\n",
150
+ " -H \"Content-Type: application/json\" \\\n",
151
+ " -d '{{\"model\": \"Alibaba-NLP/gte-Qwen2-1.5B-instruct\", \"input\": {json.dumps(input_ids)}}}'\"\"\"\n",
152
+ "\n",
153
+ "input_ids_embedding = json.loads(subprocess.check_output(curl_ids, shell=True))[\"data\"][\n",
154
+ " 0\n",
155
+ "][\"embedding\"]\n",
156
+ "\n",
157
+ "print_highlight(f\"Input IDs embedding (first 10): {input_ids_embedding[:10]}\")"
158
+ ]
159
+ },
160
+ {
161
+ "cell_type": "code",
162
+ "execution_count": null,
163
+ "metadata": {},
164
+ "outputs": [],
165
+ "source": [
166
+ "terminate_process(embedding_process)"
167
+ ]
168
+ },
169
+ {
170
+ "cell_type": "markdown",
171
+ "metadata": {},
172
+ "source": [
173
+ "## Multi-Modal Embedding Model\n",
174
+ "Please refer to [Multi-Modal Embedding Model](../supported_models/retrieval_ranking/embedding_models.md)"
175
+ ]
176
+ }
177
+ ],
178
+ "metadata": {
179
+ "language_info": {
180
+ "codemirror_mode": {
181
+ "name": "ipython",
182
+ "version": 3
183
+ },
184
+ "file_extension": ".py",
185
+ "mimetype": "text/x-python",
186
+ "name": "python",
187
+ "nbconvert_exporter": "python",
188
+ "pygments_lexer": "ipython3"
189
+ }
190
+ },
191
+ "nbformat": 4,
192
+ "nbformat_minor": 2
193
+ }
sglang/docs/basic_usage/openai_api_vision.ipynb ADDED
@@ -0,0 +1,252 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "metadata": {},
6
+ "source": [
7
+ "# OpenAI APIs - Vision\n",
8
+ "\n",
9
+ "SGLang provides OpenAI-compatible APIs to enable a smooth transition from OpenAI services to self-hosted local models.\n",
10
+ "A complete reference for the API is available in the [OpenAI API Reference](https://platform.openai.com/docs/guides/vision).\n",
11
+ "This tutorial covers the vision APIs for vision language models.\n",
12
+ "\n",
13
+ "SGLang supports various vision language models such as Llama 3.2, LLaVA-OneVision, Qwen2.5-VL, Gemma3 and [more](../supported_models/text_generation/multimodal_language_models.md).\n",
14
+ "\n",
15
+ "As an alternative to the OpenAI API, you can also use the [SGLang offline engine](https://github.com/sgl-project/sglang/blob/main/examples/runtime/engine/offline_batch_inference_vlm.py)."
16
+ ]
17
+ },
18
+ {
19
+ "cell_type": "markdown",
20
+ "metadata": {},
21
+ "source": [
22
+ "## Launch A Server\n",
23
+ "\n",
24
+ "Launch the server in your terminal and wait for it to initialize."
25
+ ]
26
+ },
27
+ {
28
+ "cell_type": "code",
29
+ "execution_count": null,
30
+ "metadata": {},
31
+ "outputs": [],
32
+ "source": [
33
+ "from sglang.test.doc_patch import launch_server_cmd\n",
34
+ "from sglang.utils import wait_for_server, print_highlight, terminate_process\n",
35
+ "\n",
36
+ "vision_process, port = launch_server_cmd(\"\"\"\n",
37
+ "python3 -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-7B-Instruct --log-level warning\n",
38
+ "\"\"\")\n",
39
+ "\n",
40
+ "wait_for_server(f\"http://localhost:{port}\", process=vision_process)"
41
+ ]
42
+ },
43
+ {
44
+ "cell_type": "markdown",
45
+ "metadata": {},
46
+ "source": [
47
+ "## Using cURL\n",
48
+ "\n",
49
+ "Once the server is up, you can send test requests using curl or requests."
50
+ ]
51
+ },
52
+ {
53
+ "cell_type": "code",
54
+ "execution_count": null,
55
+ "metadata": {},
56
+ "outputs": [],
57
+ "source": [
58
+ "import subprocess\n",
59
+ "\n",
60
+ "curl_command = f\"\"\"\n",
61
+ "curl -s http://localhost:{port}/v1/chat/completions \\\\\n",
62
+ " -H \"Content-Type: application/json\" \\\\\n",
63
+ " -d '{{\n",
64
+ " \"model\": \"Qwen/Qwen2.5-VL-7B-Instruct\",\n",
65
+ " \"messages\": [\n",
66
+ " {{\n",
67
+ " \"role\": \"user\",\n",
68
+ " \"content\": [\n",
69
+ " {{\n",
70
+ " \"type\": \"text\",\n",
71
+ " \"text\": \"What’s in this image?\"\n",
72
+ " }},\n",
73
+ " {{\n",
74
+ " \"type\": \"image_url\",\n",
75
+ " \"image_url\": {{\n",
76
+ " \"url\": \"https://github.com/sgl-project/sglang/blob/main/examples/assets/example_image.png?raw=true\"\n",
77
+ " }}\n",
78
+ " }}\n",
79
+ " ]\n",
80
+ " }}\n",
81
+ " ],\n",
82
+ " \"max_tokens\": 300\n",
83
+ " }}'\n",
84
+ "\"\"\"\n",
85
+ "\n",
86
+ "response = subprocess.check_output(curl_command, shell=True).decode()\n",
87
+ "print_highlight(response)\n",
88
+ "\n",
89
+ "\n",
90
+ "response = subprocess.check_output(curl_command, shell=True).decode()\n",
91
+ "print_highlight(response)"
92
+ ]
93
+ },
94
+ {
95
+ "cell_type": "markdown",
96
+ "metadata": {},
97
+ "source": [
98
+ "## Using Python Requests"
99
+ ]
100
+ },
101
+ {
102
+ "cell_type": "code",
103
+ "execution_count": null,
104
+ "metadata": {},
105
+ "outputs": [],
106
+ "source": [
107
+ "import requests\n",
108
+ "\n",
109
+ "url = f\"http://localhost:{port}/v1/chat/completions\"\n",
110
+ "\n",
111
+ "data = {\n",
112
+ " \"model\": \"Qwen/Qwen2.5-VL-7B-Instruct\",\n",
113
+ " \"messages\": [\n",
114
+ " {\n",
115
+ " \"role\": \"user\",\n",
116
+ " \"content\": [\n",
117
+ " {\"type\": \"text\", \"text\": \"What’s in this image?\"},\n",
118
+ " {\n",
119
+ " \"type\": \"image_url\",\n",
120
+ " \"image_url\": {\n",
121
+ " \"url\": \"https://github.com/sgl-project/sglang/blob/main/examples/assets/example_image.png?raw=true\"\n",
122
+ " },\n",
123
+ " },\n",
124
+ " ],\n",
125
+ " }\n",
126
+ " ],\n",
127
+ " \"max_tokens\": 300,\n",
128
+ "}\n",
129
+ "\n",
130
+ "response = requests.post(url, json=data)\n",
131
+ "print_highlight(response.text)"
132
+ ]
133
+ },
134
+ {
135
+ "cell_type": "markdown",
136
+ "metadata": {},
137
+ "source": [
138
+ "## Using OpenAI Python Client"
139
+ ]
140
+ },
141
+ {
142
+ "cell_type": "code",
143
+ "execution_count": null,
144
+ "metadata": {},
145
+ "outputs": [],
146
+ "source": [
147
+ "from openai import OpenAI\n",
148
+ "\n",
149
+ "client = OpenAI(base_url=f\"http://localhost:{port}/v1\", api_key=\"None\")\n",
150
+ "\n",
151
+ "response = client.chat.completions.create(\n",
152
+ " model=\"Qwen/Qwen2.5-VL-7B-Instruct\",\n",
153
+ " messages=[\n",
154
+ " {\n",
155
+ " \"role\": \"user\",\n",
156
+ " \"content\": [\n",
157
+ " {\n",
158
+ " \"type\": \"text\",\n",
159
+ " \"text\": \"What is in this image?\",\n",
160
+ " },\n",
161
+ " {\n",
162
+ " \"type\": \"image_url\",\n",
163
+ " \"image_url\": {\n",
164
+ " \"url\": \"https://github.com/sgl-project/sglang/blob/main/examples/assets/example_image.png?raw=true\"\n",
165
+ " },\n",
166
+ " },\n",
167
+ " ],\n",
168
+ " }\n",
169
+ " ],\n",
170
+ " max_tokens=300,\n",
171
+ ")\n",
172
+ "\n",
173
+ "print_highlight(response.choices[0].message.content)"
174
+ ]
175
+ },
176
+ {
177
+ "cell_type": "markdown",
178
+ "metadata": {},
179
+ "source": [
180
+ "## Multiple-Image Inputs\n",
181
+ "\n",
182
+ "The server also supports multiple images and interleaved text and images if the model supports it."
183
+ ]
184
+ },
185
+ {
186
+ "cell_type": "code",
187
+ "execution_count": null,
188
+ "metadata": {},
189
+ "outputs": [],
190
+ "source": [
191
+ "from openai import OpenAI\n",
192
+ "\n",
193
+ "client = OpenAI(base_url=f\"http://localhost:{port}/v1\", api_key=\"None\")\n",
194
+ "\n",
195
+ "response = client.chat.completions.create(\n",
196
+ " model=\"Qwen/Qwen2.5-VL-7B-Instruct\",\n",
197
+ " messages=[\n",
198
+ " {\n",
199
+ " \"role\": \"user\",\n",
200
+ " \"content\": [\n",
201
+ " {\n",
202
+ " \"type\": \"image_url\",\n",
203
+ " \"image_url\": {\n",
204
+ " \"url\": \"https://github.com/sgl-project/sglang/blob/main/examples/assets/example_image.png?raw=true\",\n",
205
+ " },\n",
206
+ " },\n",
207
+ " {\n",
208
+ " \"type\": \"image_url\",\n",
209
+ " \"image_url\": {\n",
210
+ " \"url\": \"https://raw.githubusercontent.com/sgl-project/sglang/main/assets/logo.png\",\n",
211
+ " },\n",
212
+ " },\n",
213
+ " {\n",
214
+ " \"type\": \"text\",\n",
215
+ " \"text\": \"I have two very different images. They are not related at all. \"\n",
216
+ " \"Please describe the first image in one sentence, and then describe the second image in another sentence.\",\n",
217
+ " },\n",
218
+ " ],\n",
219
+ " }\n",
220
+ " ],\n",
221
+ " temperature=0,\n",
222
+ ")\n",
223
+ "\n",
224
+ "print_highlight(response.choices[0].message.content)"
225
+ ]
226
+ },
227
+ {
228
+ "cell_type": "code",
229
+ "execution_count": null,
230
+ "metadata": {},
231
+ "outputs": [],
232
+ "source": [
233
+ "terminate_process(vision_process)"
234
+ ]
235
+ }
236
+ ],
237
+ "metadata": {
238
+ "language_info": {
239
+ "codemirror_mode": {
240
+ "name": "ipython",
241
+ "version": 3
242
+ },
243
+ "file_extension": ".py",
244
+ "mimetype": "text/x-python",
245
+ "name": "python",
246
+ "nbconvert_exporter": "python",
247
+ "pygments_lexer": "ipython3"
248
+ }
249
+ },
250
+ "nbformat": 4,
251
+ "nbformat_minor": 2
252
+ }
sglang/docs/basic_usage/popular_model_usage.rst ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Popular Model Usage (DeepSeek, GPT-OSS, GLM, Llama, MiniMax, Qwen, and more)
2
+ ===============================================================
3
+
4
+ For more usage examples and recipes, visit the `SGLang Cookbook <https://cookbook.sglang.io/>`_.
5
+
6
+ .. toctree::
7
+ :maxdepth: 1
8
+
9
+ deepseek_v3.md
10
+ deepseek_v32.md
11
+ glm45.md
12
+ glmv.md
13
+ gpt_oss.md
14
+ minimax_m2.md
15
+ qwen3.md
16
+ qwen3_5.md
17
+ qwen3_vl.md
18
+ deepseek_ocr.md
19
+ llama4.md
sglang/docs/basic_usage/qwen3.md ADDED
@@ -0,0 +1,39 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Qwen3-Next Usage
2
+
3
+ SGLang has supported Qwen3-Next-80B-A3B-Instruct and Qwen3-Next-80B-A3B-Thinking since [this PR](https://github.com/sgl-project/sglang/pull/10233).
4
+
5
+ ## Launch Qwen3-Next with SGLang
6
+
7
+ To serve Qwen3-Next models on 4xH100/H200 GPUs:
8
+
9
+ ```bash
10
+ python3 -m sglang.launch_server --model Qwen/Qwen3-Next-80B-A3B-Instruct --tp 4
11
+ ```
12
+
13
+ ### Configuration Tips
14
+ - `--max-mamba-cache-size`: Adjust `--max-mamba-cache-size` to increase mamba cache space and max running requests capability. It will decrease KV cache space as a trade-off. You can adjust it according to workload.
15
+ - `--mamba-ssm-dtype`: `bfloat16` or `float32`, use `bfloat16` to save mamba cache size and `float32` to get more accurate results. The default setting is `float32`.
16
+ - `--mamba-full-memory-ratio`: The ratio of mamba state memory to full kv cache memory. The default is 0.9.
17
+
18
+ ### Mamba Radix Cache
19
+ SGLang supports prefix caching for Qwen3-Next models named `MambaRadixCache`, which improves inference speed by reusing computation results. There are two versions of `MambaRadixCache`:
20
+ - `no_buffer`: The default version, which is also other hybrid linear models' choice. When it is enabled, SGLang will automatically close overlap schedule for compatibility reasons.
21
+ - `extra_buffer`: An optimized version that is compatible with features like page size > 1, overlap schedule, and speculative decoding. It also supports storing mamba state in branching positions. However, it requires two extra mamba spaces for a ping-pong buffer for each request. To enable it, add the argument `--mamba-scheduler-strategy extra_buffer` when launching the server.
22
+
23
+ ### EAGLE Speculative Decoding
24
+ **Description**: SGLang has supported Qwen3-Next models with [EAGLE speculative decoding](https://docs.sglang.io/advanced_features/speculative_decoding.html#EAGLE-Decoding).
25
+
26
+ **Usage**:
27
+ Add arguments `--speculative-algorithm`, `--speculative-num-steps`, `--speculative-eagle-topk` and `--speculative-num-draft-tokens` to enable this feature. For example:
28
+
29
+ ``` bash
30
+ python3 -m sglang.launch_server \
31
+ --model Qwen/Qwen3-Next-80B-A3B-Instruct \
32
+ --tp 4 \
33
+ --speculative-num-steps 3 \
34
+ --speculative-eagle-topk 1 \
35
+ --speculative-num-draft-tokens 4 \
36
+ --speculative-algo NEXTN
37
+ ```
38
+
39
+ Details can be seen in [this PR](https://github.com/sgl-project/sglang/pull/10233).
sglang/docs/basic_usage/qwen3_vl.md ADDED
@@ -0,0 +1,130 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Qwen3-VL Usage
2
+
3
+ [Qwen3-VL](https://huggingface.co/collections/Qwen/qwen3-vl)
4
+ is Alibaba’s latest multimodal large language model with strong text, vision, and reasoning capabilities.
5
+ SGLang supports Qwen3-VL Family of models with Image and Video input support.
6
+
7
+ ## Launch commands for SGLang
8
+
9
+ Below are suggested launch commands tailored for different hardware / precision modes
10
+
11
+ ### FP8 (quantised) mode
12
+ For high memory-efficiency and latency optimized deployments (e.g., on H100, H200) where FP8 checkpoint is supported:
13
+ ```bash
14
+ python3 -m sglang.launch_server \
15
+ --model-path Qwen/Qwen3-VL-235B-A22B-Instruct-FP8 \
16
+ --tp 8 \
17
+ --ep 8 \
18
+ --host 0.0.0.0 \
19
+ --port 30000 \
20
+ --keep-mm-feature-on-device
21
+ ```
22
+
23
+ ### Non-FP8 (BF16 / full precision) mode
24
+ For deployments on A100/H100 where BF16 is used (or FP8 snapshot not used):
25
+ ```bash
26
+ python3 -m sglang.launch_server \
27
+ --model-path Qwen/Qwen3-VL-235B-A22B-Instruct \
28
+ --tp 8 \
29
+ --ep 8 \
30
+ --host 0.0.0.0 \
31
+ --port 30000 \
32
+ ```
33
+
34
+ ## Hardware-specific notes / recommendations
35
+
36
+ - On H100 with FP8: Use the FP8 checkpoint for best memory efficiency.
37
+ - On A100 / H100 with BF16 (non-FP8): It’s recommended to use `--mm-max-concurrent-calls` to control parallel throughput and GPU memory usage during image/video inference.
38
+ - On H200 & B200: The model can be run “out of the box”, supporting full context length plus concurrent image + video processing.
39
+
40
+ ## Sending Image/Video Requests
41
+
42
+ ### Image input:
43
+
44
+ ```python
45
+ import requests
46
+
47
+ url = f"http://localhost:30000/v1/chat/completions"
48
+
49
+ data = {
50
+ "model": "Qwen/Qwen3-VL-30B-A3B-Instruct",
51
+ "messages": [
52
+ {
53
+ "role": "user",
54
+ "content": [
55
+ {"type": "text", "text": "What’s in this image?"},
56
+ {
57
+ "type": "image_url",
58
+ "image_url": {
59
+ "url": "https://github.com/sgl-project/sglang/blob/main/examples/assets/example_image.png?raw=true"
60
+ },
61
+ },
62
+ ],
63
+ }
64
+ ],
65
+ "max_tokens": 300,
66
+ }
67
+
68
+ response = requests.post(url, json=data)
69
+ print(response.text)
70
+ ```
71
+
72
+ ### Video Input:
73
+
74
+ ```python
75
+ import requests
76
+
77
+ url = f"http://localhost:30000/v1/chat/completions"
78
+
79
+ data = {
80
+ "model": "Qwen/Qwen3-VL-30B-A3B-Instruct",
81
+ "messages": [
82
+ {
83
+ "role": "user",
84
+ "content": [
85
+ {"type": "text", "text": "What’s happening in this video?"},
86
+ {
87
+ "type": "video_url",
88
+ "video_url": {
89
+ "url": "https://github.com/sgl-project/sgl-test-files/raw/refs/heads/main/videos/jobs_presenting_ipod.mp4"
90
+ },
91
+ },
92
+ ],
93
+ }
94
+ ],
95
+ "max_tokens": 300,
96
+ }
97
+
98
+ response = requests.post(url, json=data)
99
+ print(response.text)
100
+ ```
101
+
102
+ ## Important Server Parameters and Flags
103
+
104
+ When launching the model server for **multimodal support**, you can use the following command-line arguments to fine-tune performance and behavior:
105
+
106
+ - `--mm-attention-backend`: Specify multimodal attention backend. Eg. `fa3`(Flash Attention 3)
107
+ - `--mm-max-concurrent-calls <value>`: Specifies the **maximum number of concurrent asynchronous multimodal data processing calls** allowed on the server. Use this to control parallel throughput and GPU memory usage during image/video inference.
108
+ - `--mm-per-request-timeout <seconds>`: Defines the **timeout duration (in seconds)** for each multimodal request. If a request exceeds this time limit (e.g., for very large video inputs), it will be automatically terminated.
109
+ - `--keep-mm-feature-on-device`: Instructs the server to **retain multimodal feature tensors on the GPU** after processing. This avoids device-to-host (D2H) memory copies and improves performance for repeated or high-frequency inference workloads.
110
+ - `SGLANG_USE_CUDA_IPC_TRANSPORT=1`: Shared memory pool based CUDA IPC for multi-modal data transport. For significantly improving e2e latency.
111
+
112
+ ### Example usage with the above optimizations:
113
+ ```bash
114
+ SGLANG_USE_CUDA_IPC_TRANSPORT=1 \
115
+ SGLANG_VLM_CACHE_SIZE_MB=0 \
116
+ python -m sglang.launch_server \
117
+ --model-path Qwen/Qwen3-VL-235B-A22B-Instruct \
118
+ --host 0.0.0.0 \
119
+ --port 30000 \
120
+ --trust-remote-code \
121
+ --tp-size 8 \
122
+ --enable-cache-report \
123
+ --log-level info \
124
+ --max-running-requests 64 \
125
+ --mem-fraction-static 0.65 \
126
+ --chunked-prefill-size 8192 \
127
+ --attention-backend fa3 \
128
+ --mm-attention-backend fa3 \
129
+ --enable-metrics
130
+ ```
sglang/docs/basic_usage/sampling_params.md ADDED
@@ -0,0 +1,347 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Sampling Parameters
2
+
3
+ This doc describes the sampling parameters of the SGLang Runtime. It is the low-level endpoint of the runtime.
4
+ If you want a high-level endpoint that can automatically handle chat templates, consider using the [OpenAI Compatible API](openai_api_completions.ipynb).
5
+
6
+ ## `/generate` Endpoint
7
+
8
+ The `/generate` endpoint accepts the following parameters in JSON format. For detailed usage, see the [native API doc](native_api.ipynb). The object is defined at `io_struct.py::GenerateReqInput`. You can also read the source code to find more arguments and docs.
9
+
10
+ | Argument | Type/Default | Description |
11
+ |----------------------------|------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------|
12
+ | text | `Optional[Union[List[str], str]] = None` | The input prompt. Can be a single prompt or a batch of prompts. |
13
+ | input_ids | `Optional[Union[List[List[int]], List[int]]] = None` | The token IDs for text; one can specify either text or input_ids. |
14
+ | input_embeds | `Optional[Union[List[List[List[float]]], List[List[float]]]] = None` | The embeddings for input_ids; one can specify either text, input_ids, or input_embeds. |
15
+ | image_data | `Optional[Union[List[List[ImageDataItem]], List[ImageDataItem], ImageDataItem]] = None` | The image input. Supports three formats: (1) **Raw images**: PIL Image, file path, URL, or base64 string; (2) **Processor output**: Dict with `format: "processor_output"` containing HuggingFace processor outputs; (3) **Precomputed embeddings**: Dict with `format: "precomputed_embedding"` and `feature` containing pre-calculated visual embeddings. Can be a single image, list of images, or list of lists of images. See [Multimodal Input Formats](#multimodal-input-formats) for details. |
16
+ | audio_data | `Optional[Union[List[AudioDataItem], AudioDataItem]] = None` | The audio input. Can be a file name, URL, or base64 encoded string. |
17
+ | sampling_params | `Optional[Union[List[Dict], Dict]] = None` | The sampling parameters as described in the sections below. |
18
+ | rid | `Optional[Union[List[str], str]] = None` | The request ID. |
19
+ | return_logprob | `Optional[Union[List[bool], bool]] = None` | Whether to return log probabilities for tokens. |
20
+ | logprob_start_len | `Optional[Union[List[int], int]] = None` | If return_logprob, the start location in the prompt for returning logprobs. Default is "-1", which returns logprobs for output tokens only. |
21
+ | top_logprobs_num | `Optional[Union[List[int], int]] = None` | If return_logprob, the number of top logprobs to return at each position. |
22
+ | token_ids_logprob | `Optional[Union[List[List[int]], List[int]]] = None` | If return_logprob, the token IDs to return logprob for. |
23
+ | return_text_in_logprobs | `bool = False` | Whether to detokenize tokens in text in the returned logprobs. |
24
+ | stream | `bool = False` | Whether to stream output. |
25
+ | lora_path | `Optional[Union[List[Optional[str]], Optional[str]]] = None` | The path to the LoRA. |
26
+ | custom_logit_processor | `Optional[Union[List[Optional[str]], str]] = None` | Custom logit processor for advanced sampling control. Must be a serialized instance of `CustomLogitProcessor` using its `to_str()` method. For usage see below. |
27
+ | return_hidden_states | `Union[List[bool], bool] = False` | Whether to return hidden states. |
28
+ | return_routed_experts | `bool = False` | Whether to return routed experts for MoE models. Requires `--enable-return-routed-experts` server flag. Returns base64-encoded int32 expert IDs as a flattened array with logical shape `[num_tokens, num_layers, top_k]`. |
29
+
30
+ ## Sampling parameters
31
+
32
+ The object is defined at `sampling_params.py::SamplingParams`. You can also read the source code to find more arguments and docs.
33
+
34
+ ### Note on defaults
35
+
36
+ By default, SGLang initializes several sampling parameters from the model's `generation_config.json` (when the server is launched with `--sampling-defaults model`, which is the default). To use SGLang/OpenAI constant defaults instead, start the server with `--sampling-defaults openai`. You can always override any parameter per request via `sampling_params`.
37
+
38
+ ```bash
39
+ # Use model-provided defaults from generation_config.json (default behavior)
40
+ python -m sglang.launch_server --model-path <MODEL> --sampling-defaults model
41
+
42
+ # Use SGLang/OpenAI constant defaults instead
43
+ python -m sglang.launch_server --model-path <MODEL> --sampling-defaults openai
44
+ ```
45
+
46
+ ### Core parameters
47
+
48
+ | Argument | Type/Default | Description |
49
+ |-----------------|----------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------|
50
+ | max_new_tokens | `int = 128` | The maximum output length measured in tokens. |
51
+ | stop | `Optional[Union[str, List[str]]] = None` | One or multiple [stop words](https://platform.openai.com/docs/api-reference/chat/create#chat-create-stop). Generation will stop if one of these words is sampled. |
52
+ | stop_token_ids | `Optional[List[int]] = None` | Provide stop words in the form of token IDs. Generation will stop if one of these token IDs is sampled. |
53
+ | stop_regex | `Optional[Union[str, List[str]]] = None` | Stop when hitting any of the regex patterns in this list |
54
+ | temperature | `float (model default; fallback 1.0)` | [Temperature](https://platform.openai.com/docs/api-reference/chat/create#chat-create-temperature) when sampling the next token. `temperature = 0` corresponds to greedy sampling, a higher temperature leads to more diversity. |
55
+ | top_p | `float (model default; fallback 1.0)` | [Top-p](https://platform.openai.com/docs/api-reference/chat/create#chat-create-top_p) selects tokens from the smallest sorted set whose cumulative probability exceeds `top_p`. When `top_p = 1`, this reduces to unrestricted sampling from all tokens. |
56
+ | top_k | `int (model default; fallback -1)` | [Top-k](https://developer.nvidia.com/blog/how-to-get-better-outputs-from-your-large-language-model/#predictability_vs_creativity) randomly selects from the `k` highest-probability tokens. |
57
+ | min_p | `float (model default; fallback 0.0)` | [Min-p](https://github.com/huggingface/transformers/issues/27670) samples from tokens with probability larger than `min_p * highest_token_probability`. |
58
+
59
+ ### Penalizers
60
+
61
+ | Argument | Type/Default | Description |
62
+ |--------------------|------------------------|------------------------------------------------------------------------------------------------------------------------------------------------|
63
+ | frequency_penalty | `float = 0.0` | Penalizes tokens based on their frequency in generation so far. Must be between `-2` and `2` where negative numbers encourage repeatment of tokens and positive number encourages sampling of new tokens. The scaling of penalization grows linearly with each appearance of a token. |
64
+ | presence_penalty | `float = 0.0` | Penalizes tokens if they appeared in the generation so far. Must be between `-2` and `2` where negative numbers encourage repeatment of tokens and positive number encourages sampling of new tokens. The scaling of the penalization is constant if a token occurred. |
65
+ | repetition_penalty | `float = 1.0` | Scales the logits of previously generated tokens to discourage (values > 1) or encourage (values < 1) repetition. Valid range is `[0, 2]`; `1.0` leaves probabilities unchanged. |
66
+ | min_new_tokens | `int = 0` | Forces the model to generate at least `min_new_tokens` until a stop word or EOS token is sampled. Note that this might lead to unintended behavior, for example, if the distribution is highly skewed towards these tokens. |
67
+
68
+ ### Constrained decoding
69
+
70
+ Please refer to our dedicated guide on [constrained decoding](../advanced_features/structured_outputs.ipynb) for the following parameters.
71
+
72
+ | Argument | Type/Default | Description |
73
+ |-----------------|---------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------|
74
+ | json_schema | `Optional[str] = None` | JSON schema for structured outputs. |
75
+ | regex | `Optional[str] = None` | Regex for structured outputs. |
76
+ | ebnf | `Optional[str] = None` | EBNF for structured outputs. |
77
+ | structural_tag | `Optional[str] = None` | The structural tag for structured outputs. |
78
+
79
+ ### Other options
80
+
81
+ | Argument | Type/Default | Description |
82
+ |-------------------------------|---------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------|
83
+ | n | `int = 1` | Specifies the number of output sequences to generate per request. (Generating multiple outputs in one request (n > 1) is discouraged; repeating the same prompts several times offers better control and efficiency.) |
84
+ | ignore_eos | `bool = False` | Don't stop generation when EOS token is sampled. |
85
+ | skip_special_tokens | `bool = True` | Remove special tokens during decoding. |
86
+ | spaces_between_special_tokens | `bool = True` | Whether or not to add spaces between special tokens during detokenization. |
87
+ | no_stop_trim | `bool = False` | Don't trim stop words or EOS token from the generated text. |
88
+ | custom_params | `Optional[List[Optional[Dict[str, Any]]]] = None` | Used when employing `CustomLogitProcessor`. For usage, see below. |
89
+
90
+ ## Examples
91
+
92
+ ### Normal
93
+
94
+ Launch a server:
95
+
96
+ ```bash
97
+ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --port 30000
98
+ ```
99
+
100
+ Send a request:
101
+
102
+ ```python
103
+ import requests
104
+
105
+ response = requests.post(
106
+ "http://localhost:30000/generate",
107
+ json={
108
+ "text": "The capital of France is",
109
+ "sampling_params": {
110
+ "temperature": 0,
111
+ "max_new_tokens": 32,
112
+ },
113
+ },
114
+ )
115
+ print(response.json())
116
+ ```
117
+
118
+ Detailed example in [send request](./send_request.ipynb).
119
+
120
+ ### Streaming
121
+
122
+ Send a request and stream the output:
123
+
124
+ ```python
125
+ import requests, json
126
+
127
+ response = requests.post(
128
+ "http://localhost:30000/generate",
129
+ json={
130
+ "text": "The capital of France is",
131
+ "sampling_params": {
132
+ "temperature": 0,
133
+ "max_new_tokens": 32,
134
+ },
135
+ "stream": True,
136
+ },
137
+ stream=True,
138
+ )
139
+
140
+ prev = 0
141
+ for chunk in response.iter_lines(decode_unicode=False):
142
+ chunk = chunk.decode("utf-8")
143
+ if chunk and chunk.startswith("data:"):
144
+ if chunk == "data: [DONE]":
145
+ break
146
+ data = json.loads(chunk[5:].strip("\n"))
147
+ output = data["text"].strip()
148
+ print(output[prev:], end="", flush=True)
149
+ prev = len(output)
150
+ print("")
151
+ ```
152
+
153
+ Detailed example in [openai compatible api](openai_api_completions.ipynb).
154
+
155
+ ### Multimodal
156
+
157
+ Launch a server:
158
+
159
+ ```bash
160
+ python3 -m sglang.launch_server --model-path lmms-lab/llava-onevision-qwen2-7b-ov
161
+ ```
162
+
163
+ Download an image:
164
+
165
+ ```bash
166
+ curl -o example_image.png -L https://github.com/sgl-project/sglang/blob/main/examples/assets/example_image.png?raw=true
167
+ ```
168
+
169
+ Send a request:
170
+
171
+ ```python
172
+ import requests
173
+
174
+ response = requests.post(
175
+ "http://localhost:30000/generate",
176
+ json={
177
+ "text": "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n"
178
+ "<|im_start|>user\n<image>\nDescribe this image in a very short sentence.<|im_end|>\n"
179
+ "<|im_start|>assistant\n",
180
+ "image_data": "example_image.png",
181
+ "sampling_params": {
182
+ "temperature": 0,
183
+ "max_new_tokens": 32,
184
+ },
185
+ },
186
+ )
187
+ print(response.json())
188
+ ```
189
+
190
+ The `image_data` can be a file name, a URL, or a base64 encoded string. See also `python/sglang/srt/utils.py:load_image`.
191
+
192
+ Streaming is supported in a similar manner as [above](#streaming).
193
+
194
+ Detailed example in [OpenAI API Vision](openai_api_vision.ipynb).
195
+
196
+ ### Structured Outputs (JSON, Regex, EBNF)
197
+
198
+ You can specify a JSON schema, regular expression or [EBNF](https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form) to constrain the model output. The model output will be guaranteed to follow the given constraints. Only one constraint parameter (`json_schema`, `regex`, or `ebnf`) can be specified for a request.
199
+
200
+ SGLang supports two grammar backends:
201
+
202
+ - [XGrammar](https://github.com/mlc-ai/xgrammar) (default): Supports JSON schema, regular expression, and EBNF constraints.
203
+ - XGrammar currently uses the [GGML BNF format](https://github.com/ggerganov/llama.cpp/blob/master/grammars/README.md).
204
+ - [Outlines](https://github.com/dottxt-ai/outlines): Supports JSON schema and regular expression constraints.
205
+
206
+ If instead you want to initialize the Outlines backend, you can use `--grammar-backend outlines` flag:
207
+
208
+ ```bash
209
+ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
210
+ --port 30000 --host 0.0.0.0 --grammar-backend [xgrammar|outlines] # xgrammar or outlines (default: xgrammar)
211
+ ```
212
+
213
+ ```python
214
+ import json
215
+ import requests
216
+
217
+ json_schema = json.dumps({
218
+ "type": "object",
219
+ "properties": {
220
+ "name": {"type": "string", "pattern": "^[\\w]+$"},
221
+ "population": {"type": "integer"},
222
+ },
223
+ "required": ["name", "population"],
224
+ })
225
+
226
+ # JSON (works with both Outlines and XGrammar)
227
+ response = requests.post(
228
+ "http://localhost:30000/generate",
229
+ json={
230
+ "text": "Here is the information of the capital of France in the JSON format.\n",
231
+ "sampling_params": {
232
+ "temperature": 0,
233
+ "max_new_tokens": 64,
234
+ "json_schema": json_schema,
235
+ },
236
+ },
237
+ )
238
+ print(response.json())
239
+
240
+ # Regular expression (Outlines backend only)
241
+ response = requests.post(
242
+ "http://localhost:30000/generate",
243
+ json={
244
+ "text": "Paris is the capital of",
245
+ "sampling_params": {
246
+ "temperature": 0,
247
+ "max_new_tokens": 64,
248
+ "regex": "(France|England)",
249
+ },
250
+ },
251
+ )
252
+ print(response.json())
253
+
254
+ # EBNF (XGrammar backend only)
255
+ response = requests.post(
256
+ "http://localhost:30000/generate",
257
+ json={
258
+ "text": "Write a greeting.",
259
+ "sampling_params": {
260
+ "temperature": 0,
261
+ "max_new_tokens": 64,
262
+ "ebnf": 'root ::= "Hello" | "Hi" | "Hey"',
263
+ },
264
+ },
265
+ )
266
+ print(response.json())
267
+ ```
268
+
269
+ Detailed example in [structured outputs](../advanced_features/structured_outputs.ipynb).
270
+
271
+ ### Custom logit processor
272
+
273
+ Launch a server with `--enable-custom-logit-processor` flag on.
274
+
275
+ ```bash
276
+ python -m sglang.launch_server \
277
+ --model-path meta-llama/Meta-Llama-3-8B-Instruct \
278
+ --port 30000 \
279
+ --enable-custom-logit-processor
280
+ ```
281
+
282
+ Define a custom logit processor that will always sample a specific token id.
283
+
284
+ ```python
285
+ from sglang.srt.sampling.custom_logit_processor import CustomLogitProcessor
286
+
287
+ class DeterministicLogitProcessor(CustomLogitProcessor):
288
+ """A dummy logit processor that changes the logits to always
289
+ sample the given token id.
290
+ """
291
+
292
+ def __call__(self, logits, custom_param_list):
293
+ # Check that the number of logits matches the number of custom parameters
294
+ assert logits.shape[0] == len(custom_param_list)
295
+ key = "token_id"
296
+
297
+ for i, param_dict in enumerate(custom_param_list):
298
+ # Mask all other tokens
299
+ logits[i, :] = -float("inf")
300
+ # Assign highest probability to the specified token
301
+ logits[i, param_dict[key]] = 0.0
302
+ return logits
303
+ ```
304
+
305
+ Send a request:
306
+
307
+ ```python
308
+ import requests
309
+
310
+ response = requests.post(
311
+ "http://localhost:30000/generate",
312
+ json={
313
+ "text": "The capital of France is",
314
+ "custom_logit_processor": DeterministicLogitProcessor().to_str(),
315
+ "sampling_params": {
316
+ "temperature": 0.0,
317
+ "max_new_tokens": 32,
318
+ "custom_params": {"token_id": 5},
319
+ },
320
+ },
321
+ )
322
+ print(response.json())
323
+ ```
324
+
325
+ Send an OpenAI chat completion request:
326
+
327
+ ```python
328
+ import openai
329
+ from sglang.utils import print_highlight
330
+
331
+ client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="None")
332
+
333
+ response = client.chat.completions.create(
334
+ model="meta-llama/Meta-Llama-3-8B-Instruct",
335
+ messages=[
336
+ {"role": "user", "content": "List 3 countries and their capitals."},
337
+ ],
338
+ temperature=0.0,
339
+ max_tokens=32,
340
+ extra_body={
341
+ "custom_logit_processor": DeterministicLogitProcessor().to_str(),
342
+ "custom_params": {"token_id": 5},
343
+ },
344
+ )
345
+
346
+ print_highlight(f"Response: {response}")
347
+ ```
sglang/docs/basic_usage/send_request.ipynb ADDED
@@ -0,0 +1,251 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "metadata": {},
6
+ "source": [
7
+ "# Sending Requests\n",
8
+ "This notebook provides a quick-start guide to use SGLang in chat completions after installation. Once your server is running, API documentation is available at `http://localhost:30000/docs` (Swagger UI), `http://localhost:30000/redoc` (ReDoc), or `http://localhost:30000/openapi.json` (OpenAPI spec, useful for AI agents). Replace `30000` with your port if using a different one.\n",
9
+ "\n",
10
+ "- For Vision Language Models, see [OpenAI APIs - Vision](openai_api_vision.ipynb).\n",
11
+ "- For Embedding Models, see [OpenAI APIs - Embedding](openai_api_embeddings.ipynb) and [Encode (embedding model)](native_api.html#Encode-(embedding-model)).\n",
12
+ "- For Reward Models, see [Classify (reward model)](native_api.html#Classify-(reward-model))."
13
+ ]
14
+ },
15
+ {
16
+ "cell_type": "markdown",
17
+ "metadata": {},
18
+ "source": [
19
+ "## Launch A Server"
20
+ ]
21
+ },
22
+ {
23
+ "cell_type": "code",
24
+ "execution_count": null,
25
+ "metadata": {},
26
+ "outputs": [],
27
+ "source": [
28
+ "from sglang.test.doc_patch import launch_server_cmd\n",
29
+ "from sglang.utils import wait_for_server, print_highlight, terminate_process\n",
30
+ "\n",
31
+ "# This is equivalent to running the following command in your terminal\n",
32
+ "# python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct --host 0.0.0.0\n",
33
+ "\n",
34
+ "server_process, port = launch_server_cmd(\"\"\"\n",
35
+ "python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct \\\n",
36
+ " --host 0.0.0.0 --log-level warning\n",
37
+ "\"\"\")\n",
38
+ "\n",
39
+ "wait_for_server(f\"http://localhost:{port}\", process=server_process)"
40
+ ]
41
+ },
42
+ {
43
+ "cell_type": "markdown",
44
+ "metadata": {},
45
+ "source": [
46
+ "## Using cURL\n"
47
+ ]
48
+ },
49
+ {
50
+ "cell_type": "code",
51
+ "execution_count": null,
52
+ "metadata": {},
53
+ "outputs": [],
54
+ "source": [
55
+ "import subprocess, json\n",
56
+ "\n",
57
+ "curl_command = f\"\"\"\n",
58
+ "curl -s http://localhost:{port}/v1/chat/completions \\\n",
59
+ " -H \"Content-Type: application/json\" \\\n",
60
+ " -d '{{\"model\": \"qwen/qwen2.5-0.5b-instruct\", \"messages\": [{{\"role\": \"user\", \"content\": \"What is the capital of France?\"}}]}}'\n",
61
+ "\"\"\"\n",
62
+ "\n",
63
+ "response = json.loads(subprocess.check_output(curl_command, shell=True))\n",
64
+ "print_highlight(response)"
65
+ ]
66
+ },
67
+ {
68
+ "cell_type": "markdown",
69
+ "metadata": {},
70
+ "source": [
71
+ "## Using Python Requests"
72
+ ]
73
+ },
74
+ {
75
+ "cell_type": "code",
76
+ "execution_count": null,
77
+ "metadata": {},
78
+ "outputs": [],
79
+ "source": [
80
+ "import requests\n",
81
+ "\n",
82
+ "url = f\"http://localhost:{port}/v1/chat/completions\"\n",
83
+ "\n",
84
+ "data = {\n",
85
+ " \"model\": \"qwen/qwen2.5-0.5b-instruct\",\n",
86
+ " \"messages\": [{\"role\": \"user\", \"content\": \"What is the capital of France?\"}],\n",
87
+ "}\n",
88
+ "\n",
89
+ "response = requests.post(url, json=data)\n",
90
+ "print_highlight(response.json())"
91
+ ]
92
+ },
93
+ {
94
+ "cell_type": "markdown",
95
+ "metadata": {},
96
+ "source": [
97
+ "## Using OpenAI Python Client"
98
+ ]
99
+ },
100
+ {
101
+ "cell_type": "code",
102
+ "execution_count": null,
103
+ "metadata": {},
104
+ "outputs": [],
105
+ "source": [
106
+ "import openai\n",
107
+ "\n",
108
+ "client = openai.Client(base_url=f\"http://127.0.0.1:{port}/v1\", api_key=\"None\")\n",
109
+ "\n",
110
+ "response = client.chat.completions.create(\n",
111
+ " model=\"qwen/qwen2.5-0.5b-instruct\",\n",
112
+ " messages=[\n",
113
+ " {\"role\": \"user\", \"content\": \"List 3 countries and their capitals.\"},\n",
114
+ " ],\n",
115
+ " temperature=0,\n",
116
+ " max_tokens=64,\n",
117
+ ")\n",
118
+ "print_highlight(response)"
119
+ ]
120
+ },
121
+ {
122
+ "cell_type": "markdown",
123
+ "metadata": {},
124
+ "source": [
125
+ "### Streaming"
126
+ ]
127
+ },
128
+ {
129
+ "cell_type": "code",
130
+ "execution_count": null,
131
+ "metadata": {},
132
+ "outputs": [],
133
+ "source": [
134
+ "import openai\n",
135
+ "\n",
136
+ "client = openai.Client(base_url=f\"http://127.0.0.1:{port}/v1\", api_key=\"None\")\n",
137
+ "\n",
138
+ "# Use stream=True for streaming responses\n",
139
+ "response = client.chat.completions.create(\n",
140
+ " model=\"qwen/qwen2.5-0.5b-instruct\",\n",
141
+ " messages=[\n",
142
+ " {\"role\": \"user\", \"content\": \"List 3 countries and their capitals.\"},\n",
143
+ " ],\n",
144
+ " temperature=0,\n",
145
+ " max_tokens=64,\n",
146
+ " stream=True,\n",
147
+ ")\n",
148
+ "\n",
149
+ "# Handle the streaming output\n",
150
+ "for chunk in response:\n",
151
+ " if chunk.choices[0].delta.content:\n",
152
+ " print(chunk.choices[0].delta.content, end=\"\", flush=True)"
153
+ ]
154
+ },
155
+ {
156
+ "cell_type": "markdown",
157
+ "metadata": {},
158
+ "source": [
159
+ "## Using Native Generation APIs\n",
160
+ "\n",
161
+ "You can also use the native `/generate` endpoint with requests, which provides more flexibility. An API reference is available at [Sampling Parameters](sampling_params.md)."
162
+ ]
163
+ },
164
+ {
165
+ "cell_type": "code",
166
+ "execution_count": null,
167
+ "metadata": {},
168
+ "outputs": [],
169
+ "source": [
170
+ "import requests\n",
171
+ "\n",
172
+ "response = requests.post(\n",
173
+ " f\"http://localhost:{port}/generate\",\n",
174
+ " json={\n",
175
+ " \"text\": \"The capital of France is\",\n",
176
+ " \"sampling_params\": {\n",
177
+ " \"temperature\": 0,\n",
178
+ " \"max_new_tokens\": 32,\n",
179
+ " },\n",
180
+ " },\n",
181
+ ")\n",
182
+ "\n",
183
+ "print_highlight(response.json())"
184
+ ]
185
+ },
186
+ {
187
+ "cell_type": "markdown",
188
+ "metadata": {},
189
+ "source": [
190
+ "### Streaming"
191
+ ]
192
+ },
193
+ {
194
+ "cell_type": "code",
195
+ "execution_count": null,
196
+ "metadata": {},
197
+ "outputs": [],
198
+ "source": [
199
+ "import requests, json\n",
200
+ "\n",
201
+ "response = requests.post(\n",
202
+ " f\"http://localhost:{port}/generate\",\n",
203
+ " json={\n",
204
+ " \"text\": \"The capital of France is\",\n",
205
+ " \"sampling_params\": {\n",
206
+ " \"temperature\": 0,\n",
207
+ " \"max_new_tokens\": 32,\n",
208
+ " },\n",
209
+ " \"stream\": True,\n",
210
+ " },\n",
211
+ " stream=True,\n",
212
+ ")\n",
213
+ "\n",
214
+ "prev = 0\n",
215
+ "for chunk in response.iter_lines(decode_unicode=False):\n",
216
+ " chunk = chunk.decode(\"utf-8\")\n",
217
+ " if chunk and chunk.startswith(\"data:\"):\n",
218
+ " if chunk == \"data: [DONE]\":\n",
219
+ " break\n",
220
+ " data = json.loads(chunk[5:].strip(\"\\n\"))\n",
221
+ " output = data[\"text\"]\n",
222
+ " print(output[prev:], end=\"\", flush=True)\n",
223
+ " prev = len(output)"
224
+ ]
225
+ },
226
+ {
227
+ "cell_type": "code",
228
+ "execution_count": null,
229
+ "metadata": {},
230
+ "outputs": [],
231
+ "source": [
232
+ "terminate_process(server_process)"
233
+ ]
234
+ }
235
+ ],
236
+ "metadata": {
237
+ "language_info": {
238
+ "codemirror_mode": {
239
+ "name": "ipython",
240
+ "version": 3
241
+ },
242
+ "file_extension": ".py",
243
+ "mimetype": "text/x-python",
244
+ "name": "python",
245
+ "nbconvert_exporter": "python",
246
+ "pygments_lexer": "ipython3"
247
+ }
248
+ },
249
+ "nbformat": 4,
250
+ "nbformat_minor": 2
251
+ }
sglang/docs/developer_guide/bench_serving.md ADDED
@@ -0,0 +1,355 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Bench Serving Guide
2
+
3
+ This guide explains how to benchmark online serving throughput and latency using `python -m sglang.bench_serving`. It supports multiple inference backends via OpenAI-compatible and native endpoints, and produces both console metrics and optional JSONL outputs.
4
+
5
+ ### What it does
6
+
7
+ - Generates synthetic or dataset-driven prompts and submits them to a target serving endpoint
8
+ - Measures throughput, time-to-first-token (TTFT), inter-token latency (ITL), per-request end-to-end latency, and more
9
+ - Supports streaming or non-streaming modes, rate control, and concurrency limits
10
+
11
+ ### Supported backends and endpoints
12
+
13
+ - `sglang` / `sglang-native`: `POST /generate`
14
+ - `sglang-oai`, `vllm`, `lmdeploy`: `POST /v1/completions`
15
+ - `sglang-oai-chat`, `vllm-chat`, `lmdeploy-chat`: `POST /v1/chat/completions`
16
+ - `trt` (TensorRT-LLM): `POST /v2/models/ensemble/generate_stream`
17
+ - `gserver`: Custom server (Not Implemented yet in this script)
18
+ - `truss`: `POST /v1/models/model:predict`
19
+
20
+ If `--base-url` is provided, requests are sent to it. Otherwise, `--host` and `--port` are used. When `--model` is not provided, the script will attempt to query `GET /v1/models` for an available model ID (OpenAI-compatible endpoints).
21
+
22
+ ### Prerequisites
23
+
24
+ - Python 3.8+
25
+ - Dependencies typically used by this script: `aiohttp`, `numpy`, `requests`, `tqdm`, `transformers`, and for some datasets `datasets`, `pillow`, `pybase64`. Install as needed.
26
+ - An inference server running and reachable via the endpoints above
27
+ - If your server requires authentication, set environment variable `OPENAI_API_KEY` (used as `Authorization: Bearer <key>`)
28
+
29
+ ### Quick start
30
+
31
+ Run a basic benchmark against an sglang server exposing `/generate`:
32
+
33
+ ```bash
34
+ python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct
35
+ ```
36
+
37
+ ```bash
38
+ python3 -m sglang.bench_serving \
39
+ --backend sglang \
40
+ --host 127.0.0.1 --port 30000 \
41
+ --num-prompts 1000 \
42
+ --model meta-llama/Llama-3.1-8B-Instruct
43
+ ```
44
+
45
+ Or, using an OpenAI-compatible endpoint (completions):
46
+
47
+ ```bash
48
+ python3 -m sglang.bench_serving \
49
+ --backend vllm \
50
+ --base-url http://127.0.0.1:8000 \
51
+ --num-prompts 1000 \
52
+ --model meta-llama/Llama-3.1-8B-Instruct
53
+ ```
54
+
55
+ ### Datasets
56
+
57
+ Select with `--dataset-name`:
58
+
59
+ - `sharegpt` (default): loads ShareGPT-style pairs; optionally restrict with `--sharegpt-context-len` and override outputs with `--sharegpt-output-len`
60
+ - `random`: random text lengths; sampled from ShareGPT token space
61
+ - `random-ids`: random token ids (can lead to gibberish)
62
+ - `image`: generates images and wraps them in chat messages; supports custom resolutions, multiple formats, and different content types
63
+ - `generated-shared-prefix`: synthetic dataset with shared long system prompts and short questions
64
+ - `mmmu`: samples from MMMU (Math split) and includes images
65
+
66
+ Common dataset flags:
67
+
68
+ - `--num-prompts N`: number of requests
69
+ - `--random-input-len`, `--random-output-len`, `--random-range-ratio`: for random/random-ids/image
70
+ - `--image-count`: Number of images per request (for `image` dataset).
71
+
72
+ - `--apply-chat-template`: apply tokenizer chat template when constructing prompts
73
+ - `--dataset-path PATH`: file path for ShareGPT json; if blank and missing, it will be downloaded and cached
74
+
75
+ Generated Shared Prefix flags (for `generated-shared-prefix`):
76
+
77
+ - `--gsp-num-groups`
78
+ - `--gsp-prompts-per-group`
79
+ - `--gsp-system-prompt-len`
80
+ - `--gsp-question-len`
81
+ - `--gsp-output-len`
82
+
83
+ Image dataset flags (for `image`):
84
+
85
+ - `--image-count`: Number of images per request
86
+ - `--image-resolution`: Image resolution; supports presets (4k, 1080p, 720p, 360p) or custom 'heightxwidth' format (e.g., 1080x1920, 512x768)
87
+ - `--image-format`: Image format (jpeg or png)
88
+ - `--image-content`: Image content type (random or blank)
89
+
90
+ ### Examples
91
+
92
+ 1. To benchmark image dataset with 3 images per request, 500 prompts, 512 input length, and 512 output length, you can run:
93
+
94
+ ```bash
95
+ python -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-3B-Instruct --disable-radix-cache
96
+ ```
97
+
98
+ ```bash
99
+ python -m sglang.bench_serving \
100
+ --backend sglang-oai-chat \
101
+ --dataset-name image \
102
+ --num-prompts 500 \
103
+ --image-count 3 \
104
+ --image-resolution 720p \
105
+ --random-input-len 512 \
106
+ --random-output-len 512
107
+ ```
108
+
109
+ 2. To benchmark random dataset with 3000 prompts, 1024 input length, and 1024 output length, you can run:
110
+
111
+ ```bash
112
+ python -m sglang.launch_server --model-path Qwen/Qwen2.5-3B-Instruct
113
+ ```
114
+
115
+ ```bash
116
+ python3 -m sglang.bench_serving \
117
+ --backend sglang \
118
+ --dataset-name random \
119
+ --num-prompts 3000 \
120
+ --random-input 1024 \
121
+ --random-output 1024 \
122
+ --random-range-ratio 0.5
123
+ ```
124
+
125
+ ### Choosing model and tokenizer
126
+
127
+ - `--model` is required unless the backend exposes `GET /v1/models`, in which case the first model ID is auto-selected.
128
+ - `--tokenizer` defaults to `--model`. Both can be HF model IDs or local paths.
129
+ - For ModelScope workflows, setting `SGLANG_USE_MODELSCOPE=true` enables fetching via ModelScope (weights are skipped for speed).
130
+ - If your tokenizer lacks a chat template, the script warns because token counting can be less robust for gibberish outputs.
131
+
132
+ ### Rate, concurrency, and streaming
133
+
134
+ - `--request-rate`: requests per second. `inf` sends all immediately (burst). Non-infinite rate uses a Poisson process for arrival times.
135
+ - `--max-concurrency`: caps concurrent in-flight requests regardless of arrival rate.
136
+ - `--disable-stream`: switch to non-streaming mode when supported; TTFT then equals total latency for chat completions.
137
+
138
+ ### Other key options
139
+
140
+ - `--output-file FILE.jsonl`: append JSONL results to file; auto-named if unspecified
141
+ - `--output-details`: include per-request arrays (generated texts, errors, ttfts, itls, input/output lens)
142
+ - `--extra-request-body '{"top_p":0.9,"temperature":0.6}'`: merged into payload (sampling params, etc.)
143
+ - `--disable-ignore-eos`: pass through EOS behavior (varies by backend)
144
+ - `--warmup-requests N`: run warmup requests with short output first (default 1)
145
+ - `--flush-cache`: call `/flush_cache` (sglang) before main run
146
+ - `--profile`: call `/start_profile` and `/stop_profile` (requires server to enable profiling, e.g., `SGLANG_TORCH_PROFILER_DIR`)
147
+ - `--lora-name name1 name2 ...`: randomly pick one per request and pass to backend (e.g., `lora_path` for sglang)
148
+ - `--tokenize-prompt`: send integer IDs instead of text (currently supports `--backend sglang` only)
149
+
150
+ ### Authentication
151
+
152
+ If your target endpoint requires OpenAI-style auth, set:
153
+
154
+ ```bash
155
+ export OPENAI_API_KEY=sk-...yourkey...
156
+ ```
157
+
158
+ The script will add `Authorization: Bearer $OPENAI_API_KEY` automatically for OpenAI-compatible routes.
159
+
160
+ ### Metrics explained
161
+
162
+ Printed after each run:
163
+
164
+ - Request throughput (req/s)
165
+ - Input token throughput (tok/s) - includes both text and vision tokens
166
+ - Output token throughput (tok/s)
167
+ - Total token throughput (tok/s) - includes both text and vision tokens
168
+ - Total input text tokens and Total input vision tokens - per-modality breakdown
169
+ - Concurrency: aggregate time of all requests divided by wall time
170
+ - End-to-End Latency (ms): mean/median/std/p99 per-request total latency
171
+ - Time to First Token (TTFT, ms): mean/median/std/p99 for streaming mode
172
+ - Inter-Token Latency (ITL, ms): mean/median/std/p95/p99/max between tokens
173
+ - TPOT (ms): Token processing time after first token, i.e., `(latency - ttft)/(tokens-1)`
174
+ - Accept length (sglang-only, if available): speculative decoding accept length
175
+
176
+ The script also retokenizes generated text with the configured tokenizer and reports "retokenized" counts.
177
+
178
+ ### JSONL output format
179
+
180
+ When `--output-file` is set, one JSON object is appended per run. Base fields:
181
+
182
+ - Arguments summary: backend, dataset, request_rate, max_concurrency, etc.
183
+ - Duration and totals: completed, total_input_tokens, total_output_tokens, retokenized totals
184
+ - Throughputs and latency statistics as printed in the console
185
+ - `accept_length` when available (sglang)
186
+
187
+ With `--output-details`, an extended object also includes arrays:
188
+
189
+ - `input_lens`, `output_lens`
190
+ - `ttfts`, `itls` (per request: ITL arrays)
191
+ - `generated_texts`, `errors`
192
+
193
+ ### End-to-end examples
194
+
195
+ 1) sglang native `/generate` (streaming):
196
+
197
+ ```bash
198
+ python3 -m sglang.bench_serving \
199
+ --backend sglang \
200
+ --host 127.0.0.1 --port 30000 \
201
+ --model meta-llama/Llama-3.1-8B-Instruct \
202
+ --dataset-name random \
203
+ --random-input-len 1024 --random-output-len 1024 --random-range-ratio 0.5 \
204
+ --num-prompts 2000 \
205
+ --request-rate 100 \
206
+ --max-concurrency 512 \
207
+ --output-file sglang_random.jsonl --output-details
208
+ ```
209
+
210
+ 2) OpenAI-compatible Completions (e.g., vLLM):
211
+
212
+ ```bash
213
+ python3 -m sglang.bench_serving \
214
+ --backend vllm \
215
+ --base-url http://127.0.0.1:8000 \
216
+ --model meta-llama/Llama-3.1-8B-Instruct \
217
+ --dataset-name sharegpt \
218
+ --num-prompts 1000 \
219
+ --sharegpt-output-len 256
220
+ ```
221
+
222
+ 3) OpenAI-compatible Chat Completions (streaming):
223
+
224
+ ```bash
225
+ python3 -m sglang.bench_serving \
226
+ --backend vllm-chat \
227
+ --base-url http://127.0.0.1:8000 \
228
+ --model meta-llama/Llama-3.1-8B-Instruct \
229
+ --dataset-name random \
230
+ --num-prompts 500 \
231
+ --apply-chat-template
232
+ ```
233
+
234
+ 4) Images (VLM) with chat template:
235
+
236
+ ```bash
237
+ python3 -m sglang.bench_serving \
238
+ --backend sglang \
239
+ --host 127.0.0.1 --port 30000 \
240
+ --model your-vlm-model \
241
+ --dataset-name image \
242
+ --image-count 2 \
243
+ --image-resolution 720p \
244
+ --random-input-len 128 --random-output-len 256 \
245
+ --num-prompts 200 \
246
+ --apply-chat-template
247
+ ```
248
+
249
+ 4a) Images with custom resolution:
250
+
251
+ ```bash
252
+ python3 -m sglang.bench_serving \
253
+ --backend sglang \
254
+ --host 127.0.0.1 --port 30000 \
255
+ --model your-vlm-model \
256
+ --dataset-name image \
257
+ --image-count 1 \
258
+ --image-resolution 512x768 \
259
+ --random-input-len 64 --random-output-len 128 \
260
+ --num-prompts 100 \
261
+ --apply-chat-template
262
+ ```
263
+
264
+ 4b) 1080p images with PNG format and blank content:
265
+
266
+ ```bash
267
+ python3 -m sglang.bench_serving \
268
+ --backend sglang \
269
+ --host 127.0.0.1 --port 30000 \
270
+ --model your-vlm-model \
271
+ --dataset-name image \
272
+ --image-count 1 \
273
+ --image-resolution 1080p \
274
+ --image-format png \
275
+ --image-content blank \
276
+ --random-input-len 64 --random-output-len 128 \
277
+ --num-prompts 100 \
278
+ --apply-chat-template
279
+ ```
280
+
281
+ 5) Generated shared prefix (long system prompts + short questions):
282
+
283
+ ```bash
284
+ python3 -m sglang.bench_serving \
285
+ --backend sglang \
286
+ --host 127.0.0.1 --port 30000 \
287
+ --model meta-llama/Llama-3.1-8B-Instruct \
288
+ --dataset-name generated-shared-prefix \
289
+ --gsp-num-groups 64 --gsp-prompts-per-group 16 \
290
+ --gsp-system-prompt-len 2048 --gsp-question-len 128 --gsp-output-len 256 \
291
+ --num-prompts 1024
292
+ ```
293
+
294
+ 6) Tokenized prompts (ids) for strict length control (sglang only):
295
+
296
+ ```bash
297
+ python3 -m sglang.bench_serving \
298
+ --backend sglang \
299
+ --host 127.0.0.1 --port 30000 \
300
+ --model meta-llama/Llama-3.1-8B-Instruct \
301
+ --dataset-name random \
302
+ --tokenize-prompt \
303
+ --random-input-len 2048 --random-output-len 256 --random-range-ratio 0.2
304
+ ```
305
+
306
+ 7) Profiling and cache flush (sglang):
307
+
308
+ ```bash
309
+ python3 -m sglang.bench_serving \
310
+ --backend sglang \
311
+ --host 127.0.0.1 --port 30000 \
312
+ --model meta-llama/Llama-3.1-8B-Instruct \
313
+ --profile \
314
+ --flush-cache
315
+ ```
316
+
317
+ 8) TensorRT-LLM streaming endpoint:
318
+
319
+ ```bash
320
+ python3 -m sglang.bench_serving \
321
+ --backend trt \
322
+ --base-url http://127.0.0.1:8000 \
323
+ --model your-trt-llm-model \
324
+ --dataset-name random \
325
+ --num-prompts 100 \
326
+ --disable-ignore-eos
327
+ ```
328
+
329
+ 9) Evaluating large-scale KVCache sharing with mooncake trace (sglang only):
330
+
331
+ ```bash
332
+ python3 -m sglang.bench_serving \
333
+ --backend sglang \
334
+ --host 127.0.0.1 --port 30000 \
335
+ --model model-name \
336
+ --dataset-name mooncake \
337
+ --mooncake-slowdown-factor 1.0 \
338
+ --mooncake-num-rounds 1000 \
339
+ --mooncake-workload conversation|mooncake|agent|synthetic
340
+ --use-trace-timestamps true \
341
+ --random-output-len 256
342
+ ```
343
+
344
+ ### Troubleshooting
345
+
346
+ - All requests failed: verify `--backend`, server URL/port, `--model`, and authentication. Check warmup errors printed by the script.
347
+ - Throughput seems too low: adjust `--request-rate` and `--max-concurrency`; verify server batch size/scheduling; ensure streaming is enabled if appropriate.
348
+ - Token counts look odd: prefer chat/instruct models with proper chat templates; otherwise tokenization of gibberish may be inconsistent.
349
+ - Image/MMMU datasets: ensure you installed extra deps (`pillow`, `datasets`, `pybase64`).
350
+ - Authentication errors (401/403): set `OPENAI_API_KEY` or disable auth on your server.
351
+
352
+ ### Notes
353
+
354
+ - The script raises the file descriptor soft limit (`RLIMIT_NOFILE`) to help with many concurrent connections.
355
+ - For sglang, `/get_server_info` is queried post-run to report speculative decoding accept length when available.
sglang/docs/developer_guide/benchmark_and_profiling.md ADDED
@@ -0,0 +1,467 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Benchmark and Profiling
2
+
3
+ ## Benchmark
4
+
5
+ SGLang provides four benchmark tools that operate at different levels of the stack. The table below summarizes their key differences:
6
+
7
+ | Tool | HTTP Server | Scheduler | Use Case |
8
+ | -------------------------- | --------------------------------------------- | --------------------------------------- | -------------------------------------------------------------------------- |
9
+ | `bench_serving` | Yes (async HTTP client to a running server) | Yes (indirectly, via server) | Realistic online serving benchmarks with latency metrics (TTFT, TPOT, ITL) |
10
+ | `bench_one_batch_server` | Yes (sends HTTP requests to a running server) | Yes (indirectly, via server) | End-to-end single-batch latency including HTTP and scheduler overhead |
11
+ | `bench_offline_throughput` | No | Yes (directly uses `Engine` in-process) | Maximum throughput measurement without HTTP overhead |
12
+ | `bench_one_batch` | No | No (directly calls `ModelRunner`) | Kernel-level latency profiling of a single static batch |
13
+
14
+ Use `bench_serving` by default unless there are specific needs.
15
+
16
+ **`bench_serving`** is an async HTTP load-testing client that sends requests at controlled rates with configurable concurrency to a running server. It measures realistic online serving metrics including time-to-first-token (TTFT), time-per-output-token (TPOT), inter-token latency (ITL), and throughput. Use `num-prompts >= 5 * max-concurrency` to measure steady-state performance. Launch a server with `sglang.launch_server` first.
17
+
18
+ ```bash
19
+ python3 -m sglang.bench_serving --backend sglang --max-concurrency 16 --num-prompts 80 --random-input-len 256 --random-output-len 32 --dataset-name random
20
+ ```
21
+
22
+ **`bench_one_batch_server`** sends a single batch as one HTTP request to a running server. Due to only having a single batch, the server is never in a steady-state and metrics will be biased. Launch a server with `sglang.launch_server` first.
23
+
24
+ ```bash
25
+ python3 -m sglang.bench_one_batch_server --base-url http://127.0.0.1:30000 --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --batch-size 32 --input-len 256 --output-len 32
26
+ ```
27
+
28
+ **`bench_offline_throughput`** directly instantiates the `Engine` object in-process (no HTTP server) and submits all requests at once via `engine.generate()`. The engine's scheduler handles batching and execution. This measures maximum achievable throughput without any network overhead.
29
+
30
+ ```bash
31
+ python3 -m sglang.bench_offline_throughput --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --num-prompts 10
32
+ ```
33
+
34
+ **`bench_one_batch`** is the lowest-level tool. It directly instantiates a `ModelRunner` and calls `extend()` / `decode()` on a fixed static batch, bypassing the scheduler entirely. The prefill and decode phases are run separately, making profiling easier but rendering the metrics unrealistic. Because there is no dynamic batching, it may run out of memory for batch sizes that a real server can handle (a real server chunks prefill into smaller batches). This is best suited for profiling individual kernel performance.
35
+
36
+ ```bash
37
+ python3 -m sglang.bench_one_batch --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --batch-size 32 --input-len 256 --output-len 32
38
+ ```
39
+
40
+ ## Profile with PyTorch Profiler
41
+
42
+ [Pytorch Profiler](https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html) is a convenient basic tool to inspect kernel execution time, call stack, and kernel overlap and occupancy.
43
+
44
+ ### Profile a server with `sglang.bench_serving`
45
+
46
+ ```bash
47
+ # set trace path
48
+ export SGLANG_TORCH_PROFILER_DIR=/root/sglang/profile_log
49
+
50
+ # start server
51
+ python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct
52
+
53
+ # send profiling request from client
54
+ python -m sglang.bench_serving --backend sglang --model meta-llama/Llama-3.1-8B-Instruct --num-prompts 10 --sharegpt-output-len 100 --profile
55
+ ```
56
+
57
+ The `SGLANG_TORCH_PROFILER_DIR` environment variable must be set on both the server and client side; otherwise, the trace file will not be generated correctly. A secure way to do this is by setting it in your shell's resource file (e.g., `~/.bashrc` for bash).
58
+
59
+ For more details, please refer to [Bench Serving Guide](./bench_serving.md).
60
+
61
+ ### Profile In PD Disaggregation Mode
62
+
63
+ When profiling in PD disaggregation mode, prefill and decode workers **must be profiled separately** due to torch profiler limitations. The `bench_serving` command provides dedicated options for this:
64
+
65
+ #### Profile Prefill Workers
66
+
67
+ ```bash
68
+ # set trace path
69
+ export SGLANG_TORCH_PROFILER_DIR=/root/sglang/profile_log
70
+
71
+ # start prefill and decode servers (see PD disaggregation docs for setup)
72
+ python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --disaggregation-mode prefill
73
+ python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --disaggregation-mode decode --port 30001 --base-gpu-id 1
74
+
75
+ # start router
76
+ python -m sglang_router.launch_router --pd-disaggregation --prefill http://127.0.0.1:30000 --decode http://127.0.0.1:30001 --host 0.0.0.0 --port 8000
77
+
78
+ # send profiling request targeting prefill workers
79
+ python -m sglang.bench_serving --backend sglang --model meta-llama/Llama-3.1-8B-Instruct --num-prompts 10 --sharegpt-output-len 100 --profile --pd-separated --profile-prefill-url http://127.0.0.1:30000
80
+ ```
81
+
82
+ #### Profile Decode Workers
83
+
84
+ ```bash
85
+ # send profiling request targeting decode workers
86
+ python -m sglang.bench_serving --backend sglang --model meta-llama/Llama-3.1-8B-Instruct --num-prompts 10 --sharegpt-output-len 100 --profile --pd-separated --profile-decode-url http://127.0.0.1:30001
87
+ ```
88
+
89
+ #### Important Notes
90
+
91
+ - `--profile-prefill-url` and `--profile-decode-url` are **mutually exclusive** - you cannot profile both at the same time
92
+ - Both options support multiple worker URLs for multi-instance setups:
93
+ ```bash
94
+ # Profile multiple prefill workers
95
+ python -m sglang.bench_serving --backend sglang --model meta-llama/Llama-3.1-8B-Instruct --num-prompts 10 --profile --pd-separated --profile-prefill-url http://127.0.0.1:30000 http://127.0.0.1:30002
96
+
97
+ # Profile multiple decode workers
98
+ python -m sglang.bench_serving --backend sglang --model meta-llama/Llama-3.1-8B-Instruct --num-prompts 10 --profile --pd-separated --profile-decode-url http://127.0.0.1:30001 http://127.0.0.1:30003
99
+ ```
100
+ - Make sure `SGLANG_TORCH_PROFILER_DIR` is set on all worker nodes before starting the servers
101
+ - For more details on setting up PD disaggregation, see [PD Disaggregation Guide](../advanced_features/pd_disaggregation.md)
102
+
103
+ ### Profile a server with `sglang.bench_offline_throughput`
104
+ ```bash
105
+ export SGLANG_TORCH_PROFILER_DIR=/root/sglang/profile_log
106
+
107
+ # profile one batch with bench_one_batch.py
108
+ # batch size can be controlled with --batch argument
109
+ python3 -m sglang.bench_one_batch --model-path meta-llama/Llama-3.1-8B-Instruct --batch 32 --input-len 1024 --output-len 10 --profile
110
+
111
+ # profile multiple batches with bench_offline_throughput.py
112
+ python -m sglang.bench_offline_throughput --model-path meta-llama/Llama-3.1-8B-Instruct --dataset-name random --num-prompts 10 --profile --mem-frac=0.8
113
+ ```
114
+
115
+ ### Profile a server with `sglang.profiler`
116
+
117
+ When the server is running (e.g., processing a decoding request), you can start live profiling immediately by sending a profile request to the server.
118
+
119
+ You can do this by running `python3 -m sglang.profiler`. For example:
120
+
121
+ ```
122
+ # Terminal 1: Send a generation request
123
+ python3 -m sglang.test.send_one
124
+
125
+ # Terminal 2: Before the above request finishes, quickly launch the following command in a separate terminal.
126
+ # It will generate a profile of the above request for several decoding batches.
127
+ python3 -m sglang.profiler
128
+ ```
129
+
130
+ You can also combine the above operations into a single command
131
+
132
+ ```
133
+ python3 -m sglang.test.send_one --profile
134
+ ```
135
+
136
+ ### Profile a server with HTTP API endpoints
137
+
138
+ SGLang provides HTTP API endpoints to control profiling on a running server. This allows you to start and stop profiling programmatically, which is useful for capturing specific workload patterns.
139
+
140
+ #### Using `/start_profile` endpoint
141
+
142
+ The `/start_profile` endpoint starts profiling on the server. You can control when profiling begins and how long it runs using the following parameters:
143
+
144
+ **Basic usage:**
145
+
146
+ ```bash
147
+ # Start profiling immediately for 10 steps
148
+ curl -X POST http://127.0.0.1:30000/start_profile \
149
+ -H "Content-Type: application/json" \
150
+ -d '{
151
+ "num_steps": 10
152
+ }'
153
+ ```
154
+
155
+ **Parameters:**
156
+
157
+ - `output_dir` (optional): Directory where profile traces will be saved. If not specified, uses `SGLANG_TORCH_PROFILER_DIR` environment variable, or `/tmp` as the default
158
+ - `num_steps` (optional): Number of steps to profile. If not specified, profiling continues until manually stopped with `/end_profile`
159
+ - `start_step` (optional): Step number at which to start profiling (inclusive). Useful for skipping warmup iterations
160
+ - `activities` (optional): List of activities to profile, e.g., `["CPU", "GPU"]`. Default is `["CPU", "GPU"]`
161
+ - `merge_profiles` (optional): Whether to merge distributed traces. Default is `false`
162
+
163
+ **Note on step ranges:** Profiling starts at `start_step` (inclusive) and continues for `num_steps` iterations. For example, with `start_step=3` and `num_steps=10`, profiling captures steps 3, 4, 5, 6, 7, 8, 9, 10, 11, and 12 (10 steps total, starting from step 3).
164
+
165
+ **Advanced usage with `start_step`:**
166
+
167
+ ```bash
168
+ # Wait 5 steps (warmup), then profile for 10 steps
169
+ curl -X POST http://127.0.0.1:30000/start_profile \
170
+ -H "Content-Type: application/json" \
171
+ -d '{
172
+ "output_dir": "/tmp/profiles",
173
+ "start_step": 5,
174
+ "num_steps": 10,
175
+ "activities": ["CPU", "GPU"]
176
+ }'
177
+ ```
178
+
179
+ **Continuous profiling (manual stop):**
180
+
181
+ ```bash
182
+ # Start profiling without num_steps - must manually stop with /end_profile
183
+ curl -X POST http://127.0.0.1:30000/start_profile
184
+ ```
185
+
186
+ #### Using `/end_profile` endpoint
187
+
188
+ The `/end_profile` endpoint stops an ongoing profiling session and saves the trace file.
189
+
190
+ ```bash
191
+ # Stop profiling and save traces
192
+ curl -X POST http://127.0.0.1:30000/end_profile
193
+ ```
194
+
195
+ This is only needed when you start profiling without specifying `num_steps`. If `num_steps` is specified, profiling will automatically stop after that many steps.
196
+
197
+ #### Example workflow
198
+
199
+ ```bash
200
+ # Terminal 1: Start the server
201
+ export SGLANG_TORCH_PROFILER_DIR=/tmp/profiles
202
+ python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct
203
+
204
+ # Terminal 2: Start continuous profiling
205
+ curl -X POST http://127.0.0.1:30000/start_profile \
206
+ -H "Content-Type: application/json" \
207
+ -d '{
208
+ "start_step": 3
209
+ }'
210
+
211
+ # Terminal 3: Send requests to generate load
212
+ python -m sglang.bench_serving --backend sglang --num-prompts 100
213
+
214
+ # Terminal 2: Stop profiling when done
215
+ curl -X POST http://127.0.0.1:30000/end_profile
216
+ ```
217
+
218
+ ### Profiler Trace Merger for Distributed Traces
219
+
220
+ SGLang now supports automatic merging of profiling traces from distributed setups with multiple parallelism types (TP, DP, PP, EP). This feature is particularly useful for analyzing performance across distributed runs.
221
+
222
+ #### Multi-Node Profiling and Shared Storage Considerations
223
+
224
+ Single-node profiler output merging is completely supported. When profiling in distributed environments spanning multiple nodes, shared storage (e.g., NFS, Lustre) should be accessible by all nodes for the output directory to enable merging of trace files.
225
+
226
+ If there is no shared storage accessible across nodes, automatic merging of trace files during profiling is not supported directly as of now.
227
+
228
+ #### HTTP API Usage
229
+
230
+ ```bash
231
+ # Start profiling with automatic trace merging enabled
232
+ curl -X POST <BASE_URL>/start_profile \
233
+ -H "Content-Type: application/json" \
234
+ -d '{
235
+ "output_dir": "/tmp/profiles", # where to store profile traces
236
+ "num_steps": 10,
237
+ "activities": ["CPU", "GPU"],
238
+ "merge_profiles": true # optional argument to merge profile traces (default=False)
239
+ }'
240
+ ```
241
+
242
+ #### Command Line Usage
243
+
244
+ ```bash
245
+ # Start profiling with merge enabled
246
+ python -m sglang.profiler \
247
+ --num-steps 10 \
248
+ --cpu \
249
+ --gpu \
250
+ --output-dir /tmp/profiles \
251
+ --merge-profiles # optional argument to merge profile traces (default=False)
252
+ ```
253
+
254
+ #### Output Files
255
+
256
+ The profile merger generates:
257
+ - Individual rank trace files: `{profile_id}-TP-{tp}-DP-{dp}-PP-{pp}-EP-{ep}.trace.json.gz`
258
+ - Merged trace file: `merged-{profile_id}.trace.json.gz`
259
+
260
+ ### Possible PyTorch bugs
261
+ If in any cases you encounter the following error (for example, using qwen 2.5 VL):
262
+ ```bash
263
+ RuntimeError: !stack.empty() INTERNAL ASSERT FAILED at "/pytorch/torch/csrc/autograd/profiler_python.cpp":983, please report a bug to PyTorch. Python replay stack is empty.
264
+ ```
265
+ This is likely a PyTorch Bug reported in [Bug: vLLM Profiler](https://github.com/vllm-project/vllm/issues/18240) and [Bug: torch.profiler.profile](https://github.com/pytorch/pytorch/issues/101632). As a workaround, you may disable `with_stack` with an environment variable such as follows:
266
+ ```bash
267
+ export SGLANG_PROFILE_WITH_STACK=False
268
+ python -m sglang.bench_offline_throughput --model-path meta-llama/Llama-3.1-8B-Instruct --dataset-name random --num-prompts 10 --profile --mem-frac=0.8
269
+ ```
270
+
271
+ ### View traces
272
+
273
+ Trace files can be loaded and visualized from:
274
+
275
+ 1. https://ui.perfetto.dev/ (any browser)
276
+ 2. chrome://tracing (Chrome browser only)
277
+
278
+ If browser cannot open trace file due to its large size,
279
+ client can generate a small trace file (<100MB) by controlling number of prompts and lengths of prompt outputs.
280
+ For example, when profiling a server,
281
+
282
+ ```bash
283
+ python -m sglang.bench_serving --backend sglang --model meta-llama/Llama-3.1-8B-Instruct --num-prompts 2 --sharegpt-output-len 100 --profile
284
+ ```
285
+
286
+ This command sets the number of prompts to 2 with `--num-prompts` argument and limits the length of output sequences to 100 with `--sharegpt-output-len` argument, which can generate a small trace file for browser to open smoothly.
287
+
288
+ Additionally, if you want to locate the SGLang Python source code through the cuda kernel in Trace, you need to disable CUDA Graph when starting the service. This can be done by using the `--disable-cuda-graph` parameter in the command to start the service.
289
+
290
+ ## Profile with Nsight
291
+
292
+ [Nsight systems](https://docs.nvidia.com/nsight-systems/) is an advanced tool that exposes more profiling details, such as register and shared memory usage, annotated code regions and low-level CUDA APIs and events.
293
+
294
+ 1. Prerequisite:
295
+
296
+ Install using apt, or run inside a [NVIDIA Docker container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch/tags) or [SGLang Docker container](https://github.com/sgl-project/sglang/tree/main/docker).
297
+
298
+ ```bash
299
+ # install nsys
300
+ # https://docs.nvidia.com/nsight-systems/InstallationGuide/index.html
301
+ apt update
302
+ apt install -y --no-install-recommends gnupg
303
+ echo "deb http://developer.download.nvidia.com/devtools/repos/ubuntu$(source /etc/lsb-release; echo "$DISTRIB_RELEASE" | tr -d .)/$(dpkg --print-architecture) /" | tee /etc/apt/sources.list.d/nvidia-devtools.list
304
+ apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub
305
+ apt update
306
+ apt install nsight-systems-cli
307
+ ```
308
+
309
+ 2. To profile a single batch, use
310
+
311
+ ```bash
312
+ nsys profile --trace-fork-before-exec=true --cuda-graph-trace=node python3 -m sglang.bench_one_batch --model meta-llama/Meta-Llama-3-8B --batch-size 64 --input-len 512
313
+ ```
314
+
315
+ 3. To profile a server, e.g.
316
+
317
+ ```bash
318
+ # launch the server, set the delay and duration times according to needs
319
+ # after the duration time has been used up, server will be killed by nsys
320
+
321
+ nsys profile --trace-fork-before-exec=true --cuda-graph-trace=node -o sglang.out --delay 60 --duration 70 python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --disable-radix-cache
322
+
323
+ # client
324
+ python3 -m sglang.bench_serving --backend sglang --num-prompts 1000 --dataset-name random --random-input 1024 --random-output 512
325
+ ```
326
+
327
+ In practice, we recommend users to set `--duration` argument to a large value. Whenever user wants the server to stop profiling. Firstly run:
328
+
329
+ ```bash
330
+ nsys sessions list
331
+ ```
332
+
333
+ to get the session id in the form of `profile-XXXXX`, then run:
334
+
335
+ ```bash
336
+ nsys stop --session=profile-XXXXX
337
+ ```
338
+
339
+ to manually kill the profiler and generate `nsys-rep` files instantly.
340
+
341
+ 4. Use NVTX to annotate code regions, e.g. to see their execution time.
342
+
343
+ ```bash
344
+ # install nvtx
345
+ pip install nvtx
346
+ ```
347
+
348
+ ```python
349
+ # code snippets
350
+ import nvtx
351
+ with nvtx.annotate("description", color="color"):
352
+ # some critical code
353
+ ```
354
+
355
+ ### Layer-wise NVTX Profiling with Nsight Systems
356
+
357
+ SGLang provides built-in layerwise NVTX annotations that can be combined with the CUDA Profiler for detailed per-layer profiling in Nsight Systems. This is particularly useful for identifying performance bottlenecks at the layer level.
358
+
359
+ #### Using `--enable-layerwise-nvtx-marker` with Nsight Systems and `/start_profile`
360
+
361
+ The `--enable-layerwise-nvtx-marker` flag automatically adds NVTX markers to every layer in your model. This is particularly powerful when combined with Nsight Systems profiling to see detailed per-layer performance.
362
+
363
+ **Method 1: Using `/start_profile` with CUDA_PROFILER (for programmatic control)**
364
+
365
+ This method allows you to control exactly when profiling starts/stops via HTTP API while Nsight Systems is running.
366
+
367
+ 1. Launch the server with layerwise NVTX enabled under Nsight Systems:
368
+
369
+ ```bash
370
+ # Terminal 1: Start server with nsys and capture-range option
371
+ nsys profile --trace-fork-before-exec=true \
372
+ --cuda-graph-trace=node \
373
+ --capture-range=cudaProfilerApi \
374
+ --capture-range-end=stop \
375
+ -o layerwise_profile \
376
+ python -m sglang.launch_server \
377
+ --model-path meta-llama/Llama-3.1-8B-Instruct \
378
+ --enable-layerwise-nvtx-marker \
379
+ --disable-cuda-graph
380
+ ```
381
+
382
+ Note: NVTX markers are not emitted for kernel launches captured by CUDA graphs. Use `--disable-cuda-graph` to ensure all layerwise NVTX markers are emitted in the trace.
383
+
384
+ 2. In another terminal, control profiling via `/start_profile` with `CUDA_PROFILER` activity:
385
+
386
+ ```bash
387
+ # Terminal 2: Wait for server to be ready, then start CUDA profiling
388
+ # Wait 3 steps for warmup, then profile for 10 steps
389
+ curl -X POST http://127.0.0.1:30000/start_profile \
390
+ -H "Content-Type: application/json" \
391
+ -d '{
392
+ "start_step": 3,
393
+ "num_steps": 10,
394
+ "activities": ["CUDA_PROFILER"]
395
+ }'
396
+ ```
397
+
398
+ 3. Send requests to generate load:
399
+
400
+ ```bash
401
+ # Terminal 3: Generate workload
402
+ python -m sglang.bench_serving --backend sglang --num-prompts 100
403
+ ```
404
+
405
+ 4. Profiling will automatically stop after 10 steps (due to `num_steps: 10`). If you hadn't specified `num_steps`, you would need to manually stop it:
406
+
407
+ ```bash
408
+ # Terminal 2: Only needed if num_steps was not specified
409
+ curl -X POST http://127.0.0.1:30000/end_profile
410
+ ```
411
+
412
+ The `--capture-range=cudaProfilerApi` option tells Nsight Systems to only capture data between `cudaProfilerStart()` and `cudaProfilerStop()` calls (triggered by `/start_profile` and `/end_profile`), reducing overhead and file size. The `start_step` parameter skips the first 3 steps to avoid capturing warmup overhead.
413
+
414
+ **Method 2: Simpler approach without `/start_profile` API**
415
+
416
+ For simpler use cases where you don't need fine-grained control over profiling start/stop, you can profile with Nsight Systems capturing the entire workload:
417
+
418
+ ```bash
419
+ # Terminal 1: Start server with layerwise NVTX
420
+ # Note: --disable-cuda-graph ensures all NVTX markers are emitted
421
+ python -m sglang.launch_server \
422
+ --model-path meta-llama/Llama-3.1-8B-Instruct \
423
+ --enable-layerwise-nvtx-marker \
424
+ --disable-cuda-graph
425
+
426
+ # Terminal 2: Profile the benchmarking client
427
+ nsys profile --trace-fork-before-exec=true \
428
+ --cuda-graph-trace=node \
429
+ -o layerwise_profile \
430
+ python -m sglang.bench_serving --backend sglang --num-prompts 10
431
+ ```
432
+
433
+ This approach profiles the entire client execution, including all server interactions. The layerwise NVTX markers will be visible in the Nsight Systems timeline.
434
+
435
+ **Viewing the profiling results:**
436
+
437
+ Open the generated `.qdrep` file with Nsight Systems:
438
+
439
+ ```bash
440
+ nsys-ui layerwise_profile.qdrep
441
+ ```
442
+
443
+ In the Nsight Systems GUI, you'll see:
444
+ - **NVTX ranges**: Each layer appears as a labeled range in the timeline with detailed information in the marker metadata
445
+ - **CUDA kernels**: All GPU kernels are shown alongside the layer annotations
446
+ - **Layer hierarchy**: The full module path (e.g., `meta-llama/Meta-Llama-3.1-8B-Instruct.model.layers.0.self_attn.qkv_proj`) helps identify specific layers. The prefix uses the full model path from `--model-path`.
447
+ - **Tensor shapes**: Input/output dimensions and parameter shapes are included in the NVTX marker data
448
+
449
+ **Benefits of layerwise NVTX profiling:**
450
+
451
+ - **Granular visibility**: See exactly which layers are taking the most time
452
+ - **Memory tracking**: Identify layers with large memory allocations
453
+ - **Bottleneck identification**: Quickly locate inefficient operations
454
+ - **Communication overhead**: In multi-GPU setups, see per-layer communication costs
455
+ - **Development debugging**: Validate that model architecture changes have the expected performance impact
456
+
457
+ ## Other tips
458
+
459
+ 1. You can benchmark a model using dummy weights by only providing the config.json file. This allows for quick testing of model variants without training. To do so, add `--load-format dummy` to the above commands and then you only need a correct `config.json` under the checkpoint folder.
460
+ 2. You can benchmark a model with modified configs (e.g., less layers) by using `--json-model-override-args`. For example, you can benchmark a model with only 2 layers and 2 kv heads using:
461
+
462
+ ```bash
463
+ python -m sglang.bench_one_batch --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --batch 32 --input-len 256 --output-len 32 --load-format dummy --json-model-override-args '{"num_hidden_layers": 1, "num_key_value_heads": 1}'
464
+ ```
465
+
466
+ 3. You can use `--python-backtrace=cuda` to see python call stack for all CUDA kernels, as in PyTorch Profiler. (Caveat: this can cause inaccurately long kernel runtimes for CUDA event based timing)
467
+ 4. For more arguments see [Nsight Systems User Guide](https://docs.nvidia.com/nsight-systems/UserGuide/index.html).
sglang/docs/developer_guide/contribution_guide.md ADDED
@@ -0,0 +1,147 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Contribution Guide
2
+
3
+ Welcome to **SGLang**! We appreciate your interest in contributing. This guide provides a concise overview of how to set up your environment, run tests, build documentation, and open a Pull Request (PR). Whether you’re fixing a small bug or developing a major feature, we encourage following these steps for a smooth contribution process.
4
+
5
+ ## Install SGLang from Source
6
+
7
+ ### Fork and clone the repository
8
+
9
+ **Note**: New contributors do **not** have the write permission to push to the official SGLang repo. Please fork the repository under your GitHub account, then clone your fork locally.
10
+
11
+ ```bash
12
+ git clone https://github.com/<your_user_name>/sglang.git
13
+ ```
14
+
15
+ ### Build from source
16
+
17
+ Refer to [Install SGLang from Source](../get_started/install.md#method-2-from-source).
18
+
19
+ ## Format code with pre-commit
20
+
21
+ We use [pre-commit](https://pre-commit.com/) to maintain consistent code style checks. Before pushing your changes, please run:
22
+
23
+ ```bash
24
+ pip3 install pre-commit
25
+ pre-commit install
26
+ pre-commit run --all-files
27
+ ```
28
+
29
+ - **`pre-commit run --all-files`** manually runs all configured checks, applying fixes if possible. If it fails the first time, re-run it to ensure lint errors are fully resolved. Make sure your code passes all checks **before** creating a Pull Request.
30
+ - **Do not commit** directly to the `main` branch. Always create a new branch (e.g., `feature/my-new-feature`), push your changes, and open a PR from that branch.
31
+
32
+ ## Run and add unit tests
33
+
34
+ If you add a new feature or fix a bug, please add corresponding unit tests to ensure coverage and prevent regression.
35
+ SGLang uses Python's built-in [unittest](https://docs.python.org/3/library/unittest.html) framework.
36
+ For detailed instructions on running tests and integrating them into CI, refer to [test/README.md](https://github.com/sgl-project/sglang/tree/main/test/README.md).
37
+
38
+ ## Write documentations
39
+
40
+ We recommend new contributors start from writing documentation, which helps you quickly understand SGLang codebase.
41
+ For more details, please refer to [docs/README.md](https://github.com/sgl-project/sglang/tree/main/docs/README.md).
42
+
43
+ ## Test the accuracy
44
+ If your code changes the model output, please run the accuracy tests. A quick sanity check is the few-shot GSM8K.
45
+
46
+ ```
47
+ # Launch a server
48
+ python3 -m sglang.launch_server --model Qwen/Qwen2-7B-Instruct
49
+
50
+ # Evaluate
51
+ python3 -m sglang.test.few_shot_gsm8k --num-questions 200
52
+ ```
53
+
54
+ Please note that the above script is primarily a sanity check, not a rigorous accuracy or speed test.
55
+ This test can have significant variance (1%–5%) in accuracy due to batching and the non-deterministic nature of the inference engine.
56
+ Also, do not rely on the "Latency/Output throughput" from this script, as it is not a proper speed test.
57
+
58
+ GSM8K is too easy for state-of-the-art models nowadays. Please try your own more challenging accuracy tests.
59
+ You can find additional accuracy eval examples in:
60
+ - [test_eval_accuracy_large.py](https://github.com/sgl-project/sglang/blob/main/test/srt/test_eval_accuracy_large.py)
61
+ - [test_gpt_oss_1gpu.py](https://github.com/sgl-project/sglang/blob/main/test/srt/test_gpt_oss_1gpu.py)
62
+
63
+ ## Benchmark the speed
64
+ Refer to [Benchmark and Profiling](../developer_guide/benchmark_and_profiling.md).
65
+
66
+ ## Requesting a review for merge
67
+ You can follow the pull request merge process described in [MAINTAINER.md](https://github.com/sgl-project/sglang/blob/main/.github/MAINTAINER.md).
68
+ You will need to work with the Merge Oncall, Codeowner, and other reviewers to get their approvals.
69
+ Then your PR can be merged.
70
+
71
+ ## How to Trigger CI Tests
72
+
73
+ We have a lot of open PRs but limited CI machines, so only top and trusted contributors have permission to trigger CI tests.
74
+ Users with permission are listed in the [CI_PERMISSIONS.json](https://github.com/sgl-project/sglang/blob/main/.github/CI_PERMISSIONS.json)
75
+
76
+ **PR authors** can always use `/rerun-failed-ci` on their own PRs, even if they are not listed in `CI_PERMISSIONS.json`.
77
+
78
+ For CI to run on a pull request, it must have the "run-ci" label. Authorized users can add the label or rerun failed tests by commenting on the PR with one of these commands:
79
+
80
+ - `/tag-run-ci-label`: Adds the "run-ci" label. Every future commit will trigger CI.
81
+ - `/rerun-failed-ci`: Reruns the failed or flaky tests from the most recent commit.
82
+ - `/tag-and-rerun-ci`: A single command that performs both `/tag-run-ci-label` and `/rerun-failed-ci`.
83
+ - `/rerun-stage <stage-name>`: Reruns a specific test stage without waiting for its dependencies. This is useful when you want to quickly validate a fix for a specific test failure instead of waiting ~30 minutes for preceding stages to complete.
84
+
85
+ If you have permission, the [Slash Command Handler](https://github.com/sgl-project/sglang/actions/workflows/slash-command-handler.yml) will run your command and react with a 👍 to your comment. It may take up to a few minutes for the reaction to appear. Here’s a usage [example](https://github.com/sgl-project/sglang/pull/14253#issuecomment-3599509302).
86
+
87
+ To avoid spamming a PR with too many `/rerun-failed-ci` comments, you can also trigger the command by editing an existing comment and adding any suffix (e.g., `/rerun-failed-ci try again`).
88
+
89
+ Example of rerunning a single test stage: `/rerun-stage unit-test-backend-4-gpu`.
90
+
91
+ If you don’t have permission and you’re not the PR author, please ask maintainers to trigger CI for you.
92
+
93
+ ### CI rate limits
94
+
95
+ Due to CI scheduling and limited resources, higher-priority PRs may preempt running jobs. In such cases, you may need to rerun the tests.
96
+
97
+ We apply CI rate limits to prevent abuse and ensure fair usage of our CI resources.
98
+
99
+ Each CI workflow has a default limit defined in its workflow configuration file. For example, in [pr-gate.yml](https://github.com/sgl-project/sglang/blob/main/.github/workflows/pr-gate.yml), the default cooldown period is 120 minutes, and each workflow can override it via the `cool-down-minutes` input parameter:
100
+
101
+ ```yaml
102
+ cool-down-minutes:
103
+ description: "Default cooldown period in minutes; 0 disables rate limiting"
104
+ type: number
105
+ default: 120
106
+ ```
107
+
108
+ Users listed in [CI_PERMISSIONS.json](https://github.com/sgl-project/sglang/blob/main/.github/CI_PERMISSIONS.json) may have a per-user cooldown interval. In practice, we use the minimum of the workflow’s default window and the user-specific interval.
109
+
110
+
111
+ ## Code style guidance
112
+ - Avoid code duplication. If the same code snippet (more than five lines) appears multiple times, extract it into a shared function.
113
+ - Minimize device synchronization. Reduce expensive CPU-GPU synchronization operations, such as `tensor.item()` or `tensor.cpu()`, whenever possible. Use vectorized code.
114
+ - Prioritize extreme efficiency. SGLang is a runtime, and most of your code runs on the critical path for every request. Optimize all minor overheads as much as possible, especially in the model forward code.
115
+ - A common pattern is some runtime checks in the model forward pass (e.g., [this](https://github.com/sgl-project/sglang/blob/f1b0eda55c2c4838e8ab90a0fac7fb1e3d7064ab/python/sglang/srt/models/deepseek_v2.py#L486-L491)). These are very likely the same for every layer. Please cache the result as a single boolean value whenever possible.
116
+ - Make functions as pure as possible. Avoid in-place modification of arguments.
117
+ - Keep files concise. If a file exceeds 2,000 lines of code, split it into multiple smaller files. (e.g., `scheduler.py`, `scheduler_output_processor_mixin.py`)
118
+ - Keep tests run fast.
119
+ - If a single test file run longer than 500 seconds, split it into multiple smaller files (e.g., `test_eagle_infer_a.py`, `test_eagle_infer_b.py`).
120
+ - If a single job in a github workflow runs longer than 30 mins, split it into smaller jobs/steps.
121
+ - Reuse server launches in your unit tests to make tests run faster.
122
+ - When supporting new hardware or features, follow these guidelines:
123
+ - Do not drastically change existing code.
124
+ - Always prefer new files to introduce specific components for your new hardware (e.g., `allocator_ascend.py`).
125
+ - If you write multiple if/else blocks for new features, ensure the common path (e.g., NVIDIA hardware or the existing code path) is the first branch.
126
+
127
+ ## How to update sgl-kernel
128
+ Since sglang and sgl-kernel are separate Python packages, our current GitHub CI infrastructure does not support updating a kernel and using it immediately within the same pull request (PR).
129
+ To add a new kernel or modify an existing one in the sgl-kernel package, you must use multiple PRs.
130
+
131
+ Follow these steps:
132
+
133
+ 1. Submit a PR to update the sgl-kernel source code without using it in sglang python package (e.g., [#8884](https://github.com/sgl-project/sglang/pull/8884/files)).
134
+ 2. Bump the version of sgl-kernel (e.g., [#9220](https://github.com/sgl-project/sglang/pull/9220/files)).
135
+ - Once merged, this will trigger an automatic release of the sgl-kernel wheel to PyPI.
136
+ - If not urgent, you can wait for other people to release the wheel. A new version will typically be released within one week.
137
+ 3. Apply the changes:
138
+ - Update the sgl-kernel version in `sglang/python/pyproject.toml` to use the modified kernels.
139
+ - Update the related caller code in the sglang to use the new kernel.
140
+
141
+ ## Tips for newcomers
142
+
143
+ If you want to contribute but don’t have a specific idea in mind, pick issues labeled [“good first issue” or “help wanted”](https://github.com/sgl-project/sglang/issues?q=is%3Aissue+label%3A%22good+first+issue%22%2C%22help+wanted%22). These tasks typically have lower complexity and provide an excellent introduction to the codebase. Also check out this [code walk-through](https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/tree/main/sglang/code-walk-through) for a deeper look into SGLang’s workflow.
144
+
145
+ If you have any questions or want to start a discussion, please feel free to ask in our [Slack channel](https://slack.sglang.io).
146
+
147
+ Thank you for your interest in SGLang. Happy coding!
sglang/docs/developer_guide/development_guide_using_docker.md ADDED
@@ -0,0 +1,108 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Development Guide Using Docker
2
+
3
+ ## Setup VSCode on a Remote Host
4
+ (Optional - you can skip this step if you plan to run sglang dev container locally)
5
+
6
+ 1. In the remote host, download `code` from [Https://code.visualstudio.com/docs/?dv=linux64cli](https://code.visualstudio.com/download) and run `code tunnel` in a shell.
7
+
8
+ Example
9
+ ```bash
10
+ wget https://vscode.download.prss.microsoft.com/dbazure/download/stable/fabdb6a30b49f79a7aba0f2ad9df9b399473380f/vscode_cli_alpine_x64_cli.tar.gz
11
+ tar xf vscode_cli_alpine_x64_cli.tar.gz
12
+
13
+ # https://code.visualstudio.com/docs/remote/tunnels
14
+ ./code tunnel
15
+ ```
16
+
17
+ 2. In your local machine, press F1 in VSCode and choose "Remote Tunnels: Connect to Tunnel".
18
+
19
+ ## Setup Docker Container
20
+
21
+ ### Option 1. Use the default dev container automatically from VSCode
22
+ There is a `.devcontainer` folder in the sglang repository root folder to allow VSCode to automatically start up within dev container. You can read more about this VSCode extension in VSCode official document [Developing inside a Container](https://code.visualstudio.com/docs/devcontainers/containers).
23
+ ![image](https://github.com/user-attachments/assets/6a245da8-2d4d-4ea8-8db1-5a05b3a66f6d)
24
+ (*Figure 1: Diagram from VSCode official documentation [Developing inside a Container](https://code.visualstudio.com/docs/devcontainers/containers).*)
25
+
26
+ To enable this, you only need to:
27
+ 1. Start Visual Studio Code and install [VSCode dev container extension](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-containers).
28
+ 2. Press F1, type and choose "Dev Container: Open Folder in Container.
29
+ 3. Input the `sglang` local repo path in your machine and press enter.
30
+
31
+ The first time you open it in dev container might take longer due to docker pull and build. Once it's successful, you should set on your status bar at the bottom left displaying that you are in a dev container:
32
+
33
+ ![image](https://github.com/user-attachments/assets/650bba0b-c023-455f-91f9-ab357340106b)
34
+
35
+ Now when you run `sglang.launch_server` in the VSCode terminal or start debugging using F5, sglang server will be started in the dev container with all your local changes applied automatically:
36
+
37
+ ![image](https://github.com/user-attachments/assets/748c85ba-7f8c-465e-8599-2bf7a8dde895)
38
+
39
+
40
+ ### Option 2. Start up containers manually (advanced)
41
+
42
+ The following startup command is an example for internal development by the SGLang team. You can **modify or add directory mappings as needed**, especially for model weight downloads, to prevent repeated downloads by different Docker containers.
43
+
44
+ ❗️ **Note on RDMA**
45
+
46
+ 1. `--network host` and `--privileged` are required by RDMA. If you don't need RDMA, you can remove them but keeping them there does not harm. Thus, we enable these two flags by default in the commands below.
47
+ 2. You may need to set `NCCL_IB_GID_INDEX` if you are using RoCE, for example: `export NCCL_IB_GID_INDEX=3`.
48
+
49
+ ```bash
50
+ # Change the name to yours
51
+ docker run -itd --shm-size 32g --gpus all -v <volumes-to-mount> --ipc=host --network=host --privileged --name sglang_dev lmsysorg/sglang:dev /bin/zsh
52
+ docker exec -it sglang_dev /bin/zsh
53
+ ```
54
+ Some useful volumes to mount are:
55
+ 1. **Huggingface model cache**: mounting model cache can avoid re-download every time docker restarts. Default location on Linux is `~/.cache/huggingface/`.
56
+ 2. **SGLang repository**: code changes in the SGLang local repository will be automatically synced to the .devcontainer.
57
+
58
+ Example 1: Mounting local cache folder `/opt/dlami/nvme/.cache` but not the SGLang repo. Use this when you prefer to manually transfer local code changes to the devcontainer.
59
+ ```bash
60
+ docker run -itd --shm-size 32g --gpus all -v /opt/dlami/nvme/.cache:/root/.cache --ipc=host --network=host --privileged --name sglang_zhyncs lmsysorg/sglang:dev /bin/zsh
61
+ docker exec -it sglang_zhyncs /bin/zsh
62
+ ```
63
+ Example 2: Mounting both HuggingFace cache and local SGLang repo. Local code changes are automatically synced to the devcontainer as the SGLang is installed in editable mode in the dev image.
64
+ ```bash
65
+ docker run -itd --shm-size 32g --gpus all -v $HOME/.cache/huggingface/:/root/.cache/huggingface -v $HOME/src/sglang:/sgl-workspace/sglang --ipc=host --network=host --privileged --name sglang_zhyncs lmsysorg/sglang:dev /bin/zsh
66
+ docker exec -it sglang_zhyncs /bin/zsh
67
+ ```
68
+ ## Debug SGLang with VSCode Debugger
69
+ 1. (Create if not exist) open `launch.json` in VSCode.
70
+ 2. Add the following config and save. Please note that you can edit the script as needed to apply different parameters or debug a different program (e.g. benchmark script).
71
+ ```JSON
72
+ {
73
+ "version": "0.2.0",
74
+ "configurations": [
75
+ {
76
+ "name": "Python Debugger: launch_server",
77
+ "type": "debugpy",
78
+ "request": "launch",
79
+ "module": "sglang.launch_server",
80
+ "console": "integratedTerminal",
81
+ "args": [
82
+ "--model-path", "meta-llama/Llama-3.2-1B",
83
+ "--host", "0.0.0.0",
84
+ "--port", "30000",
85
+ "--trust-remote-code",
86
+ ],
87
+ "justMyCode": false
88
+ }
89
+ ]
90
+ }
91
+ ```
92
+
93
+ 3. Press "F5" to start. VSCode debugger will ensure that the program will pause at the breakpoints even if the program is running at remote SSH/Tunnel host + dev container.
94
+
95
+ ## Profile
96
+
97
+ ```bash
98
+ # Change batch size, input, output and add `disable-cuda-graph` (for easier analysis)
99
+ # e.g. DeepSeek V3
100
+ nsys profile -o deepseek_v3 python3 -m sglang.bench_one_batch --batch-size 1 --input 128 --output 256 --model deepseek-ai/DeepSeek-V3 --trust-remote-code --tp 8 --disable-cuda-graph
101
+ ```
102
+
103
+ ## Evaluation
104
+
105
+ ```bash
106
+ # e.g. gsm8k 8 shot
107
+ python3 benchmark/gsm8k/bench_sglang.py --num-questions 2000 --parallel 2000 --num-shots 8
108
+ ```
sglang/docs/developer_guide/development_jit_kernel_guide.md ADDED
@@ -0,0 +1,259 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Development Guide for JIT Kernels
2
+
3
+ ## Environment Setup
4
+
5
+ We strongly recommend using `clangd` as the language server for JIT kernel development.
6
+ For Ubuntu/Debian, you can download clangd from [apt.llvm.org](https://apt.llvm.org/).
7
+ If you are using VS Code, we recommend installing the `clangd` extension for better IDE integration.
8
+
9
+ All JIT-related files are located in `python/sglang/jit_kernel`.
10
+ Unlike `sgl-kernel`, which compiles CUDA/C++ binaries ahead of time (AOT), just-in-time (JIT) kernels are compiled at runtime.
11
+ Consequently, a static `compile_commands.json` cannot be generated.
12
+ To enable code completion with `clangd`, run `python -m sglang.jit_kernel` to generate a `.clangd` configuration file in your current directory.
13
+ After generating the file, restart the clangd language server. It should now recognize all JIT kernel files.
14
+
15
+ ## Code Structure
16
+
17
+ ### C++ Implementation
18
+
19
+ C++ source code is located in `python/sglang/jit_kernel/csrc`.
20
+ Reusable functions should be placed in `python/sglang/jit_kernel/include`.
21
+
22
+ We use [tvm-ffi](https://github.com/apache/tvm-ffi) for efficient foreign language bindings.
23
+ Refer to the [documentation](https://tvm.apache.org/ffi/) for advanced usage, such as exporting C++ objects.
24
+ Typically, `tvm::ffi::TensorView` is sufficient for passing PyTorch Tensors from Python.
25
+
26
+ ### Python Interface
27
+
28
+ Python interfaces are defined in `python/sglang/jit_kernel`.
29
+ The `load_jit` utility function in `python/sglang/jit_kernel/utils.py` loads and returns the compiled module.
30
+ To export a C++ function (e.g., `cpp_func`), pass `cuda_wrappers=[("func", "cpp_func")]` to `load_jit`.
31
+ The function can then be called in Python as `module.func`.
32
+
33
+ For caching compiled modules, prefer `sglang.jit_kernel.utils.cache_once` over `functools.lru_cache`.
34
+ `functools.lru_cache` is not compatible with `torch.compile`.
35
+
36
+ ### C++ Utilities
37
+
38
+ The following C++ utilities are available:
39
+
40
+ #### Integer Range
41
+
42
+ Similar to PyTorch, we provide an `irange` function to represent an integer range.
43
+
44
+ ```C++
45
+ #include <sgl_kernel/utils.h>
46
+
47
+ void test() {
48
+ for (auto i : host::irange(100)) { // [0, 100)
49
+ // do something
50
+ }
51
+ for (auto i : host::irange(0, 100)) { // [0, 100)
52
+ // do something
53
+ }
54
+ }
55
+
56
+ ```
57
+
58
+ #### Runtime Checking
59
+
60
+ `RuntimeCheck` validates conditions at runtime. It accepts optional arguments for error reporting.
61
+ If the check fails, these arguments are output to aid debugging.
62
+ `RuntimeDeviceCheck` verifies the status of the last kernel launch.
63
+
64
+ ```C++
65
+ #include <sgl_kernel/utils.h>
66
+ #include <sgl_kernel/utils.cuh>
67
+
68
+ void test() {
69
+ host::RuntimeCheck(1 + 1 == 2, 1 + 1, " != ", 2);
70
+ host::RuntimeDeviceCheck();
71
+ // check the provided `cudaError_t`
72
+ host::RuntimeDeviceCheck(cudaGetLastError());
73
+ }
74
+
75
+ ```
76
+
77
+ #### Tensor Checking
78
+
79
+ `TensorMatcher` provides a readable way to validate and extract tensor shape information.
80
+
81
+ ```cpp
82
+ #include <sgl_kernel/tensor.h>
83
+
84
+ void test(const tvm::ffi::TensorView k_cache, const tvm::ffi::TensorView v_cache) {
85
+ using namespace host;
86
+
87
+ auto D = SymbolicSize{"D"}; // cache dimension
88
+ auto N = SymbolicSize{"N"}; // kvcache stride
89
+ auto dtype = SymbolicDType{};
90
+ auto device = SymbolicDevice{};
91
+
92
+ TensorMatcher({-1, D}) //
93
+ .with_strides({N, 1})
94
+ .with_dtype<int32_t, int64_t>(dtype)
95
+ .with_device<kDLCUDA, kDLCPU>(device)
96
+ .verify(k_cache)
97
+ .verify(v_cache);
98
+ }
99
+ ```
100
+
101
+ Configure the `TensorMatcher` with expected stride, dtype, and device properties before verification.
102
+ - If `with_strides` is omitted, the tensor is expected to be contiguous.
103
+ - Template arguments in `with_dtype` restrict the allowed data types.
104
+ - Template arguments in `with_device` restrict the allowed devices.
105
+ - Values passed to `with_xxx` methods enforce equality checks.
106
+ - Passing `-1` for size or stride allows matching any value.
107
+
108
+ A `Symbolic` variable must resolve to the same value across all verifications.
109
+ Use `.unwrap()` to retrieve the matched value after verification.
110
+
111
+ > Note: `TensorMatcher` is a temporary expression and should not be stored in a variable.
112
+
113
+ > Tip: Add `//` at the end of the `TensorMatcher` chain to enforce proper indentation.
114
+
115
+ #### Kernel Launching
116
+
117
+ `LaunchKernel::resolve_device` retrieves the current `cudaStream` from PyTorch.
118
+ Kernels can also be launched directly using `LaunchKernel`.
119
+
120
+ ```cpp
121
+ #include <sgl_kernel/utils.cuh>
122
+
123
+ #include <dlpack/dlpack.h>
124
+
125
+ __global__ void kernel() {}
126
+
127
+ void test() {
128
+ const auto num_blocks = 1;
129
+ const auto num_threads = 32;
130
+ const auto dynamic_smem = 0;
131
+
132
+ DLDevice dev; // suppose this is initialized properly
133
+ host::LaunchKernel(num_blocks, num_threads, dev)(kernel);
134
+
135
+ cudaStream_t stream = host::LaunchKernel::resolve_device(dev);
136
+ host::LaunchKernel(num_blocks, num_threads, stream, dynamic_smem)(kernel);
137
+ }
138
+
139
+ ```
140
+
141
+ ## Add new kernels
142
+
143
+ This section walks through a complete, end-to-end example of adding a new JIT kernel to the system.
144
+ We use a simple add_constant kernel as a running example, which adds a constant integer value to every element of an input tensor.
145
+
146
+ Conceptually, the Python interface looks like this:
147
+
148
+ ```python
149
+ def add_constant(src: torch.Tensor, c: int):
150
+ return src + c
151
+ ```
152
+
153
+ ### STEP 1: Write the C++ kernel
154
+
155
+ Write your CUDA kernel in [jit_kernel/csrc/add_constant.cuh](../../python/sglang/jit_kernel/csrc/add_constant.cuh). For demonstration purposes, we pass the constant value as a template parameter.
156
+
157
+ ```cpp
158
+ #include <sgl_kernel/tensor.h> // For TensorMatcher, SymbolicSize, SymbolicDevice
159
+ #include <sgl_kernel/utils.cuh> // For LaunchKernel
160
+ #include <sgl_kernel/utils.h> // For div_ceil, RuntimeCheck
161
+
162
+ #include <dlpack/dlpack.h>
163
+ #include <tvm/ffi/container/tensor.h>
164
+
165
+ #include <cstddef>
166
+ #include <cstdint>
167
+
168
+ namespace {
169
+
170
+ template <int32_t kConstant>
171
+ __global__ void add_constant_kernel(int32_t* dst, const int32_t* src, size_t length) {
172
+ size_t idx = blockIdx.x * blockDim.x + threadIdx.x;
173
+ if (idx < length) {
174
+ dst[idx] = src[idx] + kConstant;
175
+ }
176
+ }
177
+
178
+ constexpr size_t kBlockSize = 256;
179
+
180
+ // You can also use struct with static method as an alternative
181
+ template <int32_t kConstant>
182
+ void add_constant(tvm::ffi::TensorView dst, tvm::ffi::TensorView src) {
183
+ using namespace host;
184
+
185
+ // 1. Validate input tensors
186
+ SymbolicSize N = {"num_elements"};
187
+ SymbolicDevice device_;
188
+ TensorMatcher({N}) // 1D tensor, must be contiguous
189
+ .with_dtype<int32_t>() // must be int32
190
+ .with_device<kDLCUDA>(device_) // must be on CUDA device
191
+ .verify(dst) // check tensor dst
192
+ .verify(src); // check tensor src
193
+
194
+ // 2. Extract required parameters, prepare for kernel launch
195
+ const size_t num_elements = N.unwrap();
196
+ const size_t grid_size = div_ceil(num_elements, kBlockSize);
197
+ const DLDevice device = device_.unwrap();
198
+ // some extra runtime checks using host::RuntimeCheck
199
+ RuntimeCheck(num_elements > 0, "We only support non-empty tensors, got num_elements = ", num_elements);
200
+
201
+ // 3. Launch the kernel. Error code will be automatically checked.
202
+ LaunchKernel(grid_size, kBlockSize, device /*, dynamic_smem*/)(
203
+ // kernel function
204
+ add_constant_kernel<kConstant>,
205
+ // kernel arguments
206
+ static_cast<int32_t*>(dst.data_ptr()),
207
+ static_cast<int32_t*>(src.data_ptr()),
208
+ num_elements);
209
+ }
210
+
211
+ } // namespace
212
+
213
+ ```
214
+
215
+ ### STEP 2: Create Python Interfaces
216
+
217
+ Next, expose the kernel through a Python wrapper.
218
+ Create a new file at [jit_kernel/add_constant.py](../../python/sglang/jit_kernel/add_constant.py) and expose the needed interfaces.
219
+
220
+ ```python
221
+ from __future__ import annotations
222
+ from typing import TYPE_CHECKING
223
+
224
+ import torch
225
+
226
+ from sglang.jit_kernel.utils import cache_once, load_jit, make_cpp_args
227
+
228
+ if TYPE_CHECKING:
229
+ from tvm_ffi.module import Module
230
+
231
+
232
+ @cache_once
233
+ def _jit_add_constant_module(constant: int) -> Module:
234
+ args = make_cpp_args(constant) # pass all the template argument
235
+ return load_jit(
236
+ "add_constant",
237
+ *args,
238
+ cuda_files=["add_constant.cuh"],
239
+ cuda_wrappers=[("add_constant", f"add_constant<{args}>")],
240
+ )
241
+
242
+
243
+ def add_constant(src: torch.Tensor, constant: int) -> torch.Tensor:
244
+ dst = torch.empty_like(src)
245
+ module = _jit_add_constant_module(constant)
246
+ module.add_constant(dst, src)
247
+ return dst
248
+
249
+ ```
250
+
251
+ ### STEP 3: Use your kernel
252
+
253
+ Finally, import and use the kernel like a regular Python function:
254
+
255
+ ```python
256
+ from sglang.jit_kernel.add_constant import add_constant
257
+ ```
258
+
259
+ For a complete, runnable example, refer to [test_add_constant.py](../../python/sglang/jit_kernel/tests/test_add_constant.py).
sglang/docs/developer_guide/evaluating_new_models.md ADDED
@@ -0,0 +1,146 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Evaluating New Models with SGLang
2
+
3
+ This document provides commands for evaluating models' accuracy and performance. Before open-sourcing new models, we strongly suggest running these commands to verify whether the score matches your internal benchmark results.
4
+
5
+ **For cross verification, please submit commands for installation, server launching, and benchmark running with all the scores and hardware requirements when open-sourcing your models.**
6
+
7
+ [Reference: MiniMax M2](https://github.com/sgl-project/sglang/pull/12129)
8
+
9
+ ## Accuracy
10
+
11
+ ### LLMs
12
+
13
+ SGLang provides built-in scripts to evaluate common benchmarks.
14
+
15
+ **MMLU**
16
+
17
+ ```bash
18
+ python -m sglang.test.run_eval \
19
+ --eval-name mmlu \
20
+ --port 30000 \
21
+ --num-examples 1000 \
22
+ --max-tokens 8192
23
+ ```
24
+
25
+ **GSM8K**
26
+
27
+ ```bash
28
+ python -m sglang.test.few_shot_gsm8k \
29
+ --host 127.0.0.1 \
30
+ --port 30000 \
31
+ --num-questions 200 \
32
+ --num-shots 5
33
+ ```
34
+
35
+ **HellaSwag**
36
+
37
+ ```bash
38
+ python benchmark/hellaswag/bench_sglang.py \
39
+ --host 127.0.0.1 \
40
+ --port 30000 \
41
+ --num-questions 200 \
42
+ --num-shots 20
43
+ ```
44
+
45
+ **GPQA**
46
+
47
+ ```bash
48
+ python -m sglang.test.run_eval \
49
+ --eval-name gpqa \
50
+ --port 30000 \
51
+ --num-examples 198 \
52
+ --max-tokens 120000 \
53
+ --repeat 8
54
+ ```
55
+
56
+ ```{tip}
57
+ For reasoning models, add `--thinking-mode <mode>` (e.g., `qwen3`, `deepseek-v3`). You may skip it if the model has forced thinking enabled.
58
+ ```
59
+
60
+ **HumanEval**
61
+
62
+ ```bash
63
+ pip install human_eval
64
+
65
+ python -m sglang.test.run_eval \
66
+ --eval-name humaneval \
67
+ --num-examples 10 \
68
+ --port 30000
69
+ ```
70
+
71
+ ### VLMs
72
+
73
+ **MMMU**
74
+
75
+ ```bash
76
+ python benchmark/mmmu/bench_sglang.py \
77
+ --port 30000 \
78
+ --concurrency 64
79
+ ```
80
+
81
+ ```{tip}
82
+ You can set max tokens by passing `--extra-request-body '{"max_tokens": 4096}'`.
83
+ ```
84
+
85
+ For models capable of processing video, we recommend extending the evaluation to include `VideoMME`, `MVBench`, and other relevant benchmarks.
86
+
87
+ ## Performance
88
+
89
+ Performance benchmarks measure **Latency** (Time To First Token - TTFT) and **Throughput** (tokens/second).
90
+
91
+ ### LLMs
92
+
93
+ **Latency-Sensitive Benchmark**
94
+
95
+ This simulates a scenario with low concurrency (e.g., single user) to measure latency.
96
+
97
+ ```bash
98
+ python -m sglang.bench_serving \
99
+ --backend sglang \
100
+ --host 0.0.0.0 \
101
+ --port 30000 \
102
+ --dataset-name random \
103
+ --num-prompts 10 \
104
+ --max-concurrency 1
105
+ ```
106
+
107
+ **Throughput-Sensitive Benchmark**
108
+
109
+ This simulates a high-traffic scenario to measure maximum system throughput.
110
+
111
+ ```bash
112
+ python -m sglang.bench_serving \
113
+ --backend sglang \
114
+ --host 0.0.0.0 \
115
+ --port 30000 \
116
+ --dataset-name random \
117
+ --num-prompts 1000 \
118
+ --max-concurrency 100
119
+ ```
120
+
121
+ **Single Batch Performance**
122
+
123
+ You can also benchmark the performance of processing a single batch offline.
124
+
125
+ ```bash
126
+ python -m sglang.bench_one_batch_server \
127
+ --model <model-path> \
128
+ --batch-size 8 \
129
+ --input-len 1024 \
130
+ --output-len 1024
131
+ ```
132
+
133
+ You can run more granular benchmarks:
134
+
135
+ - **Low Concurrency**: `--num-prompts 10 --max-concurrency 1`
136
+ - **Medium Concurrency**: `--num-prompts 80 --max-concurrency 16`
137
+ - **High Concurrency**: `--num-prompts 500 --max-concurrency 100`
138
+
139
+ ## Reporting Results
140
+
141
+ For each evaluation, please report:
142
+
143
+ 1. **Metric Score**: Accuracy % (LLMs and VLMs); Latency (ms) and Throughput (tok/s) (LLMs only).
144
+ 2. **Environment settings**: GPU type/count, SGLang commit hash.
145
+ 3. **Launch configuration**: Model path, TP size, and any special flags.
146
+ 4. **Evaluation parameters**: Number of shots, examples, max tokens.
sglang/docs/developer_guide/release_process.md ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # PyPI Package Release Process
2
+
3
+ ## Update the version in code
4
+ Update the package version in `python/pyproject.toml` and `python/sglang/__init__.py`.
5
+
6
+ ## Upload the PyPI package
7
+
8
+ ```
9
+ pip install build twine
10
+ ```
11
+
12
+ ```
13
+ cd python
14
+ bash upload_pypi.sh
15
+ ```
16
+
17
+ ## Make a release in GitHub
18
+ Make a new release https://github.com/sgl-project/sglang/releases/new.
sglang/docs/developer_guide/setup_github_runner.md ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Set Up Self-Hosted Runners for GitHub Actions
2
+
3
+ ## Add a Runner
4
+
5
+ ### Step 1: Start a docker container.
6
+
7
+ **You can mount a folder for the shared huggingface model weights cache. **
8
+ The command below uses `/tmp/huggingface` as an example.
9
+
10
+ ```
11
+ docker pull nvidia/cuda:12.9.1-devel-ubuntu22.04
12
+ # Nvidia
13
+ docker run --shm-size 128g -it -v /tmp/huggingface:/hf_home --gpus all nvidia/cuda:12.9.1-devel-ubuntu22.04 /bin/bash
14
+ # AMD
15
+ docker run --rm --device=/dev/kfd --device=/dev/dri --group-add video --shm-size 128g -it -v /tmp/huggingface:/hf_home lmsysorg/sglang:v0.5.8-rocm700-mi30x /bin/bash
16
+ # AMD just the last 2 GPUs
17
+ docker run --rm --device=/dev/kfd --device=/dev/dri/renderD176 --device=/dev/dri/renderD184 --group-add video --shm-size 128g -it -v /tmp/huggingface:/hf_home lmsysorg/sglang:v0.5.8-rocm700-mi30x /bin/bash
18
+ ```
19
+
20
+ ### Step 2: Configure the runner by `config.sh`
21
+
22
+ Run these commands inside the container.
23
+
24
+ ```
25
+ apt update && apt install -y curl python3-pip git
26
+ pip install --upgrade pip
27
+ export RUNNER_ALLOW_RUNASROOT=1
28
+ ```
29
+
30
+ Then follow https://github.com/sgl-project/sglang/settings/actions/runners/new?arch=x64&os=linux to run `config.sh`
31
+
32
+ **Notes**
33
+ - Do not need to specify the runner group
34
+ - Give it a name (e.g., `test-sgl-gpu-0`) and some labels (e.g., `1-gpu-runner`). The labels can be edited later in Github Settings.
35
+ - Do not need to change the work folder.
36
+
37
+ ### Step 3: Run the runner by `run.sh`
38
+
39
+ - Set up environment variables
40
+ ```
41
+ export HF_HOME=/hf_home
42
+ export SGLANG_IS_IN_CI=true
43
+ export HF_TOKEN=hf_xxx
44
+ export OPENAI_API_KEY=sk-xxx
45
+ export CUDA_VISIBLE_DEVICES=0
46
+ ```
47
+
48
+ - Run it forever
49
+ ```
50
+ while true; do ./run.sh; echo "Restarting..."; sleep 2; done
51
+ ```
sglang/docs/diffusion/api/cli.md ADDED
@@ -0,0 +1,332 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # SGLang diffusion CLI Inference
2
+
3
+ The SGLang-diffusion CLI provides a quick way to access the inference pipeline for image and video generation.
4
+
5
+ ## Prerequisites
6
+
7
+ - A working SGLang diffusion installation and the `sglang` CLI available in `$PATH`.
8
+
9
+
10
+ ## Supported Arguments
11
+
12
+ ### Server Arguments
13
+
14
+ - `--model-path {MODEL_PATH}`: Path to the model or model ID
15
+ - `--lora-path {LORA_PATH}`: Path to a LoRA adapter (local path or HuggingFace model ID). If not specified, LoRA will not be applied.
16
+ - `--lora-nickname {NAME}`: Nickname for the LoRA adapter. (default: `default`).
17
+ - `--num-gpus {NUM_GPUS}`: Number of GPUs to use
18
+ - `--tp-size {TP_SIZE}`: Tensor parallelism size (only for the encoder; should not be larger than 1 if text encoder offload is enabled, as layer-wise offload plus prefetch is faster)
19
+ - `--sp-degree {SP_SIZE}`: Sequence parallelism size (typically should match the number of GPUs)
20
+ - `--ulysses-degree {ULYSSES_DEGREE}`: The degree of DeepSpeed-Ulysses-style SP in USP
21
+ - `--ring-degree {RING_DEGREE}`: The degree of ring attention-style SP in USP
22
+ - `--attention-backend {BACKEND}`: Attention backend to use. For SGLang-native pipelines use `fa`, `torch_sdpa`, `sage_attn`, etc. For diffusers pipelines use diffusers backend names like `flash`, `_flash_3_hub`, `sage`, `xformers`.
23
+ - `--attention-backend-config {CONFIG}`: Configuration for the attention backend. Can be a JSON string (e.g., '{"k": "v"}'), a path to a JSON/YAML file, or key=value pairs (e.g., "k=v,k2=v2").
24
+ - `--cache-dit-config {PATH}`: Path to a Cache-DiT YAML/JSON config (diffusers backend only)
25
+ - `--dit-precision {DTYPE}`: Precision for the DiT model (currently supports fp32, fp16, and bf16).
26
+
27
+
28
+ ### Sampling Parameters
29
+
30
+ - `--prompt {PROMPT}`: Text description for the video you want to generate
31
+ - `--num-inference-steps {STEPS}`: Number of denoising steps
32
+ - `--negative-prompt {PROMPT}`: Negative prompt to guide generation away from certain concepts
33
+ - `--seed {SEED}`: Random seed for reproducible generation
34
+
35
+
36
+ **Image/Video Configuration**
37
+
38
+ - `--height {HEIGHT}`: Height of the generated output
39
+ - `--width {WIDTH}`: Width of the generated output
40
+ - `--num-frames {NUM_FRAMES}`: Number of frames to generate
41
+ - `--fps {FPS}`: Frames per second for the saved output, if this is a video-generation task
42
+
43
+
44
+ **Frame Interpolation** (video only)
45
+
46
+ Frame interpolation is a post-processing step that synthesizes new frames
47
+ between each pair of consecutive generated frames, producing smoother
48
+ motion without re-running the diffusion model. The `--frame-interpolation-exp`
49
+ flag controls how many rounds of interpolation to apply: each round inserts one
50
+ new frame into every gap between adjacent frames, so the output frame count
51
+ follows the formula **(N − 1) × 2^exp + 1** (e.g. 5 original frames with
52
+ `exp=1` → 4 gaps × 1 new frame + 5 originals = **9** frames; with `exp=2` →
53
+ **17** frames).
54
+
55
+ - `--enable-frame-interpolation`: Enable frame interpolation. Model weights are downloaded automatically on first use.
56
+ - `--frame-interpolation-exp {EXP}`: Interpolation exponent — `1` = 2× temporal resolution, `2` = 4×, etc. (default: `1`)
57
+ - `--frame-interpolation-scale {SCALE}`: RIFE inference scale; use `0.5` for high-resolution inputs to save memory (default: `1.0`)
58
+ - `--frame-interpolation-model-path {PATH}`: Local directory or HuggingFace repo ID containing RIFE `flownet.pkl` weights (default: `elfgum/RIFE-4.22.lite`, downloaded automatically)
59
+
60
+ Example — generate a 5-frame video and interpolate to 9 frames ((5 − 1) × 2¹ + 1 = 9):
61
+
62
+ ```bash
63
+ sglang generate \
64
+ --model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers \
65
+ --prompt "A dog running through a park" \
66
+ --num-frames 5 \
67
+ --enable-frame-interpolation \
68
+ --frame-interpolation-exp 1 \
69
+ --save-output
70
+ ```
71
+
72
+ **Output Options**
73
+
74
+ - `--output-path {PATH}`: Directory to save the generated video
75
+ - `--save-output`: Whether to save the image/video to disk
76
+ - `--return-frames`: Whether to return the raw frames
77
+
78
+ ### Using Configuration Files
79
+
80
+ Instead of specifying all parameters on the command line, you can use a configuration file:
81
+
82
+ ```bash
83
+ sglang generate --config {CONFIG_FILE_PATH}
84
+ ```
85
+
86
+ The configuration file should be in JSON or YAML format with the same parameter names as the CLI options. Command-line arguments take precedence over settings in the configuration file, allowing you to override specific values while keeping the rest from the configuration file.
87
+
88
+ Example configuration file (config.json):
89
+
90
+ ```json
91
+ {
92
+ "model_path": "FastVideo/FastHunyuan-diffusers",
93
+ "prompt": "A beautiful woman in a red dress walking down a street",
94
+ "output_path": "outputs/",
95
+ "num_gpus": 2,
96
+ "sp_size": 2,
97
+ "tp_size": 1,
98
+ "num_frames": 45,
99
+ "height": 720,
100
+ "width": 1280,
101
+ "num_inference_steps": 6,
102
+ "seed": 1024,
103
+ "fps": 24,
104
+ "precision": "bf16",
105
+ "vae_precision": "fp16",
106
+ "vae_tiling": true,
107
+ "vae_sp": true,
108
+ "vae_config": {
109
+ "load_encoder": false,
110
+ "load_decoder": true,
111
+ "tile_sample_min_height": 256,
112
+ "tile_sample_min_width": 256
113
+ },
114
+ "text_encoder_precisions": [
115
+ "fp16",
116
+ "fp16"
117
+ ],
118
+ "mask_strategy_file_path": null,
119
+ "enable_torch_compile": false
120
+ }
121
+ ```
122
+
123
+ Or using YAML format (config.yaml):
124
+
125
+ ```yaml
126
+ model_path: "FastVideo/FastHunyuan-diffusers"
127
+ prompt: "A beautiful woman in a red dress walking down a street"
128
+ output_path: "outputs/"
129
+ num_gpus: 2
130
+ sp_size: 2
131
+ tp_size: 1
132
+ num_frames: 45
133
+ height: 720
134
+ width: 1280
135
+ num_inference_steps: 6
136
+ seed: 1024
137
+ fps: 24
138
+ precision: "bf16"
139
+ vae_precision: "fp16"
140
+ vae_tiling: true
141
+ vae_sp: true
142
+ vae_config:
143
+ load_encoder: false
144
+ load_decoder: true
145
+ tile_sample_min_height: 256
146
+ tile_sample_min_width: 256
147
+ text_encoder_precisions:
148
+ - "fp16"
149
+ - "fp16"
150
+ mask_strategy_file_path: null
151
+ enable_torch_compile: false
152
+ ```
153
+
154
+
155
+ To see all the options, you can use the `--help` flag:
156
+
157
+ ```bash
158
+ sglang generate --help
159
+ ```
160
+
161
+ ## Serve
162
+
163
+ Launch the SGLang diffusion HTTP server and interact with it using the OpenAI SDK and curl.
164
+
165
+ ### Start the server
166
+
167
+ Use the following command to launch the server:
168
+
169
+ ```bash
170
+ SERVER_ARGS=(
171
+ --model-path Wan-AI/Wan2.1-T2V-1.3B-Diffusers
172
+ --text-encoder-cpu-offload
173
+ --pin-cpu-memory
174
+ --num-gpus 4
175
+ --ulysses-degree=2
176
+ --ring-degree=2
177
+ )
178
+
179
+ sglang serve "${SERVER_ARGS[@]}"
180
+ ```
181
+
182
+ - **--model-path**: Which model to load. The example uses `Wan-AI/Wan2.1-T2V-1.3B-Diffusers`.
183
+ - **--port**: HTTP port to listen on (the default here is `30010`).
184
+
185
+ For detailed API usage, including Image, Video Generation and LoRA management, please refer to the [OpenAI API Documentation](openai_api.md).
186
+
187
+ ### Cloud Storage Support
188
+
189
+ SGLang diffusion supports automatically uploading generated images and videos to S3-compatible cloud storage (e.g., AWS S3, MinIO, Alibaba Cloud OSS, Tencent Cloud COS).
190
+
191
+ When enabled, the server follows a **Generate -> Upload -> Delete** workflow:
192
+ 1. The artifact is generated to a temporary local file.
193
+ 2. The file is immediately uploaded to the configured S3 bucket in a background thread.
194
+ 3. Upon successful upload, the local file is deleted.
195
+ 4. The API response returns the public URL of the uploaded object.
196
+
197
+ **Configuration**
198
+
199
+ Cloud storage is enabled via environment variables. Note that `boto3` must be installed separately (`pip install boto3`) to use this feature.
200
+
201
+ ```bash
202
+ # Enable S3 storage
203
+ export SGLANG_CLOUD_STORAGE_TYPE=s3
204
+ export SGLANG_S3_BUCKET_NAME=my-bucket
205
+ export SGLANG_S3_ACCESS_KEY_ID=your-access-key
206
+ export SGLANG_S3_SECRET_ACCESS_KEY=your-secret-key
207
+
208
+ # Optional: Custom endpoint for MinIO/OSS/COS
209
+ export SGLANG_S3_ENDPOINT_URL=https://minio.example.com
210
+ ```
211
+
212
+ See [Environment Variables Documentation](../environment_variables.md) for more details.
213
+
214
+ ## Generate
215
+
216
+ Run a one-off generation task without launching a persistent server.
217
+
218
+ To use it, pass both server arguments and sampling parameters in one command, after the `generate` subcommand, for example:
219
+
220
+ ```bash
221
+ SERVER_ARGS=(
222
+ --model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers
223
+ --text-encoder-cpu-offload
224
+ --pin-cpu-memory
225
+ --num-gpus 4
226
+ --ulysses-degree=2
227
+ --ring-degree=2
228
+ )
229
+
230
+ SAMPLING_ARGS=(
231
+ --prompt "A curious raccoon"
232
+ --save-output
233
+ --output-path outputs
234
+ --output-file-name "A curious raccoon.mp4"
235
+ )
236
+
237
+ sglang generate "${SERVER_ARGS[@]}" "${SAMPLING_ARGS[@]}"
238
+
239
+ # Or, users can set `SGLANG_CACHE_DIT_ENABLED` env as `true` to enable cache acceleration
240
+ SGLANG_CACHE_DIT_ENABLED=true sglang generate "${SERVER_ARGS[@]}" "${SAMPLING_ARGS[@]}"
241
+ ```
242
+
243
+ Once the generation task has finished, the server will shut down automatically.
244
+
245
+ > [!NOTE]
246
+ > The HTTP server-related arguments are ignored in this subcommand.
247
+
248
+ ## Component Path Overrides
249
+
250
+ SGLang diffusion allows you to override any pipeline component (e.g., `vae`, `transformer`, `text_encoder`) by specifying a custom checkpoint path. This is useful for:
251
+
252
+ ### Example: FLUX.2-dev with Tiny AutoEncoder
253
+
254
+ You can override **any** component by using `--<component>-path`, where `<component>` matches the key in the model's `model_index.json`:
255
+
256
+ For example, replace the default VAE with a distilled tiny autoencoder for ~3x faster decoding:
257
+
258
+ ```bash
259
+ sglang serve \
260
+ --model-path=black-forest-labs/FLUX.2-dev \
261
+ # with a Huggingface Repo ID
262
+ --vae-path=fal/FLUX.2-Tiny-AutoEncoder
263
+ # or use a local path
264
+ --vae-path=~/.cache/huggingface/hub/models--fal--FLUX.2-Tiny-AutoEncoder/snapshots/.../vae
265
+ ```
266
+
267
+ **Important:**
268
+ - The component key must match the one in your model's `model_index.json` (e.g., `vae`).
269
+ - The path must:
270
+ - either be a Huggingface Repo ID (e.g., fal/FLUX.2-Tiny-AutoEncoder)
271
+ - or point to a **complete component folder**, containing `config.json` and safetensors files
272
+
273
+
274
+ ## Diffusers Backend
275
+
276
+ SGLang diffusion supports a **diffusers backend** that allows you to run any diffusers-compatible model through SGLang's infrastructure using vanilla diffusers pipelines. This is useful for running models without native SGLang implementations or models with custom pipeline classes.
277
+
278
+ ### Arguments
279
+
280
+ | Argument | Values | Description |
281
+ |----------|--------|-------------|
282
+ | `--backend` | `auto` (default), `sglang`, `diffusers` | `auto`: prefer native SGLang, fallback to diffusers. `sglang`: force native (fails if unavailable). `diffusers`: force vanilla diffusers pipeline. |
283
+ | `--diffusers-attention-backend` | `flash`, `_flash_3_hub`, `sage`, `xformers`, `native` | Attention backend for diffusers pipelines. See [diffusers attention backends](https://huggingface.co/docs/diffusers/main/en/optimization/attention_backends). |
284
+ | `--trust-remote-code` | flag | Required for models with custom pipeline classes (e.g., Ovis). |
285
+ | `--vae-tiling` | flag | Enable VAE tiling for large image support (decodes tile-by-tile). |
286
+ | `--vae-slicing` | flag | Enable VAE slicing for lower memory usage (decodes slice-by-slice). |
287
+ | `--dit-precision` | `fp16`, `bf16`, `fp32` | Precision for the diffusion transformer. |
288
+ | `--vae-precision` | `fp16`, `bf16`, `fp32` | Precision for the VAE. |
289
+ | `--enable-torch-compile` | flag | Enable `torch.compile` for diffusers pipelines. |
290
+ | `--cache-dit-config` | `{PATH}` | Path to a Cache-DiT YAML/JSON config file for accelerating diffusers pipelines with Cache-DiT. |
291
+
292
+ ### Example: Running Ovis-Image-7B
293
+
294
+ [Ovis-Image-7B](https://huggingface.co/AIDC-AI/Ovis-Image-7B) is a 7B text-to-image model optimized for high-quality text rendering.
295
+
296
+ ```bash
297
+ sglang generate \
298
+ --model-path AIDC-AI/Ovis-Image-7B \
299
+ --backend diffusers \
300
+ --trust-remote-code \
301
+ --diffusers-attention-backend flash \
302
+ --prompt "A serene Japanese garden with cherry blossoms" \
303
+ --height 1024 \
304
+ --width 1024 \
305
+ --num-inference-steps 30 \
306
+ --save-output \
307
+ --output-path outputs \
308
+ --output-file-name ovis_garden.png
309
+ ```
310
+
311
+ ### Extra Diffusers Arguments
312
+
313
+ For pipeline-specific parameters not exposed via CLI, use `diffusers_kwargs` in a config file:
314
+
315
+ ```json
316
+ {
317
+ "model_path": "AIDC-AI/Ovis-Image-7B",
318
+ "backend": "diffusers",
319
+ "prompt": "A beautiful landscape",
320
+ "diffusers_kwargs": {
321
+ "cross_attention_kwargs": {"scale": 0.5}
322
+ }
323
+ }
324
+ ```
325
+
326
+ ```bash
327
+ sglang generate --config config.json
328
+ ```
329
+
330
+ ### Cache-DiT Acceleration
331
+
332
+ Users who use the diffusers backend can also leverage Cache-DiT acceleration and load custom cache configs from a YAML file to boost performance of diffusers pipelines. See the [Cache-DiT Acceleration](https://docs.sglang.io/diffusion/performance/cache/cache_dit.html) documentation for details.
sglang/docs/diffusion/api/openai_api.md ADDED
@@ -0,0 +1,420 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # SGLang Diffusion OpenAI API
2
+
3
+ The SGLang diffusion HTTP server implements an OpenAI-compatible API for image and video generation, as well as LoRA adapter management.
4
+
5
+ ## Prerequisites
6
+
7
+ - Python 3.11+ if you plan to use the OpenAI Python SDK.
8
+
9
+ ## Serve
10
+
11
+ Launch the server using the `sglang serve` command.
12
+
13
+ ### Start the server
14
+
15
+ ```bash
16
+ SERVER_ARGS=(
17
+ --model-path Wan-AI/Wan2.1-T2V-1.3B-Diffusers
18
+ --text-encoder-cpu-offload
19
+ --pin-cpu-memory
20
+ --num-gpus 4
21
+ --ulysses-degree=2
22
+ --ring-degree=2
23
+ --port 30010
24
+ )
25
+
26
+ sglang serve "${SERVER_ARGS[@]}"
27
+ ```
28
+
29
+ - **--model-path**: Path to the model or model ID.
30
+ - **--port**: HTTP port to listen on (default: `30000`).
31
+
32
+ **Get Model Information**
33
+
34
+ **Endpoint:** `GET /models`
35
+
36
+ Returns information about the model served by this server, including model path, task type, pipeline configuration, and precision settings.
37
+
38
+ **Curl Example:**
39
+
40
+ ```bash
41
+ curl -sS -X GET "http://localhost:30010/models"
42
+ ```
43
+
44
+ **Response Example:**
45
+
46
+ ```json
47
+ {
48
+ "model_path": "Wan-AI/Wan2.1-T2V-1.3B-Diffusers",
49
+ "task_type": "T2V",
50
+ "pipeline_name": "wan_pipeline",
51
+ "pipeline_class": "WanPipeline",
52
+ "num_gpus": 4,
53
+ "dit_precision": "bf16",
54
+ "vae_precision": "fp16"
55
+ }
56
+ ```
57
+
58
+ ---
59
+
60
+ ## Endpoints
61
+
62
+ ### Image Generation
63
+
64
+ The server implements an OpenAI-compatible Images API under the `/v1/images` namespace.
65
+
66
+ **Create an image**
67
+
68
+ **Endpoint:** `POST /v1/images/generations`
69
+
70
+ **Python Example (b64_json response):**
71
+
72
+ ```python
73
+ import base64
74
+ from openai import OpenAI
75
+
76
+ client = OpenAI(api_key="sk-proj-1234567890", base_url="http://localhost:30010/v1")
77
+
78
+ img = client.images.generate(
79
+ prompt="A calico cat playing a piano on stage",
80
+ size="1024x1024",
81
+ n=1,
82
+ response_format="b64_json",
83
+ )
84
+
85
+ image_bytes = base64.b64decode(img.data[0].b64_json)
86
+ with open("output.png", "wb") as f:
87
+ f.write(image_bytes)
88
+ ```
89
+
90
+ **Curl Example:**
91
+
92
+ ```bash
93
+ curl -sS -X POST "http://localhost:30010/v1/images/generations" \
94
+ -H "Content-Type: application/json" \
95
+ -H "Authorization: Bearer sk-proj-1234567890" \
96
+ -d '{
97
+ "prompt": "A calico cat playing a piano on stage",
98
+ "size": "1024x1024",
99
+ "n": 1,
100
+ "response_format": "b64_json"
101
+ }'
102
+ ```
103
+
104
+ > **Note**
105
+ > If `response_format=url` is used and cloud storage is not configured, the API returns
106
+ > a relative URL like `/v1/images/<IMAGE_ID>/content`.
107
+
108
+ **Edit an image**
109
+
110
+ **Endpoint:** `POST /v1/images/edits`
111
+
112
+ This endpoint accepts a multipart form upload with input images and a text prompt. The server can return either a base64-encoded image or a URL to download the image.
113
+
114
+ **Curl Example (b64_json response):**
115
+
116
+ ```bash
117
+ curl -sS -X POST "http://localhost:30010/v1/images/edits" \
118
+ -H "Authorization: Bearer sk-proj-1234567890" \
119
+ -F "image=@local_input_image.png" \
120
+ -F "url=image_url.jpg" \
121
+ -F "prompt=A calico cat playing a piano on stage" \
122
+ -F "size=1024x1024" \
123
+ -F "response_format=b64_json"
124
+ ```
125
+
126
+ **Curl Example (URL response):**
127
+
128
+ ```bash
129
+ curl -sS -X POST "http://localhost:30010/v1/images/edits" \
130
+ -H "Authorization: Bearer sk-proj-1234567890" \
131
+ -F "image=@local_input_image.png" \
132
+ -F "url=image_url.jpg" \
133
+ -F "prompt=A calico cat playing a piano on stage" \
134
+ -F "size=1024x1024" \
135
+ -F "response_format=url"
136
+ ```
137
+
138
+ **Download image content**
139
+
140
+ When `response_format=url` is used with `POST /v1/images/generations` or `POST /v1/images/edits`,
141
+ the API returns a relative URL like `/v1/images/<IMAGE_ID>/content`.
142
+
143
+ **Endpoint:** `GET /v1/images/{image_id}/content`
144
+
145
+ **Curl Example:**
146
+
147
+ ```bash
148
+ curl -sS -L "http://localhost:30010/v1/images/<IMAGE_ID>/content" \
149
+ -H "Authorization: Bearer sk-proj-1234567890" \
150
+ -o output.png
151
+ ```
152
+
153
+ ### Video Generation
154
+
155
+ The server implements a subset of the OpenAI Videos API under the `/v1/videos` namespace.
156
+
157
+ **Create a video**
158
+
159
+ **Endpoint:** `POST /v1/videos`
160
+
161
+ **Python Example:**
162
+
163
+ ```python
164
+ from openai import OpenAI
165
+
166
+ client = OpenAI(api_key="sk-proj-1234567890", base_url="http://localhost:30010/v1")
167
+
168
+ video = client.videos.create(
169
+ prompt="A calico cat playing a piano on stage",
170
+ size="1280x720"
171
+ )
172
+ print(f"Video ID: {video.id}, Status: {video.status}")
173
+ ```
174
+
175
+ **Curl Example:**
176
+
177
+ ```bash
178
+ curl -sS -X POST "http://localhost:30010/v1/videos" \
179
+ -H "Content-Type: application/json" \
180
+ -H "Authorization: Bearer sk-proj-1234567890" \
181
+ -d '{
182
+ "prompt": "A calico cat playing a piano on stage",
183
+ "size": "1280x720"
184
+ }'
185
+ ```
186
+
187
+ **List videos**
188
+
189
+ **Endpoint:** `GET /v1/videos`
190
+
191
+ **Python Example:**
192
+
193
+ ```python
194
+ videos = client.videos.list()
195
+ for item in videos.data:
196
+ print(item.id, item.status)
197
+ ```
198
+
199
+ **Curl Example:**
200
+
201
+ ```bash
202
+ curl -sS -X GET "http://localhost:30010/v1/videos" \
203
+ -H "Authorization: Bearer sk-proj-1234567890"
204
+ ```
205
+
206
+ **Download video content**
207
+
208
+ **Endpoint:** `GET /v1/videos/{video_id}/content`
209
+
210
+ **Python Example:**
211
+
212
+ ```python
213
+ import time
214
+
215
+ # Poll for completion
216
+ while True:
217
+ page = client.videos.list()
218
+ item = next((v for v in page.data if v.id == video_id), None)
219
+ if item and item.status == "completed":
220
+ break
221
+ time.sleep(5)
222
+
223
+ # Download content
224
+ resp = client.videos.download_content(video_id=video_id)
225
+ with open("output.mp4", "wb") as f:
226
+ f.write(resp.read())
227
+ ```
228
+
229
+ **Curl Example:**
230
+
231
+ ```bash
232
+ curl -sS -L "http://localhost:30010/v1/videos/<VIDEO_ID>/content" \
233
+ -H "Authorization: Bearer sk-proj-1234567890" \
234
+ -o output.mp4
235
+ ```
236
+
237
+ ---
238
+
239
+ ### LoRA Management
240
+
241
+ The server supports dynamic loading, merging, and unmerging of LoRA adapters.
242
+
243
+ **Important Notes:**
244
+ - Mutual Exclusion: Only one LoRA can be *merged* (active) at a time
245
+ - Switching: To switch LoRAs, you must first `unmerge` the current one, then `set` the new one
246
+ - Caching: The server caches loaded LoRA weights in memory. Switching back to a previously loaded LoRA (same path) has little cost
247
+
248
+ **Set LoRA Adapter**
249
+
250
+ Loads one or more LoRA adapters and merges their weights into the model. Supports both single LoRA (backward compatible) and multiple LoRA adapters.
251
+
252
+ **Endpoint:** `POST /v1/set_lora`
253
+
254
+ **Parameters:**
255
+ - `lora_nickname` (string or list of strings, required): A unique identifier for the LoRA adapter(s). Can be a single string or a list of strings for multiple LoRAs
256
+ - `lora_path` (string or list of strings/None, optional): Path to the `.safetensors` file(s) or Hugging Face repo ID(s). Required for the first load; optional if re-activating a cached nickname. If a list, must match the length of `lora_nickname`
257
+ - `target` (string or list of strings, optional): Which transformer(s) to apply the LoRA to. If a list, must match the length of `lora_nickname`. Valid values:
258
+ - `"all"` (default): Apply to all transformers
259
+ - `"transformer"`: Apply only to the primary transformer (high noise for Wan2.2)
260
+ - `"transformer_2"`: Apply only to transformer_2 (low noise for Wan2.2)
261
+ - `"critic"`: Apply only to the critic model
262
+ - `strength` (float or list of floats, optional): LoRA strength for merge, default 1.0. If a list, must match the length of `lora_nickname`. Values < 1.0 reduce the effect, values > 1.0 amplify the effect
263
+
264
+ **Single LoRA Example:**
265
+
266
+ ```bash
267
+ curl -X POST http://localhost:30010/v1/set_lora \
268
+ -H "Content-Type: application/json" \
269
+ -d '{
270
+ "lora_nickname": "lora_name",
271
+ "lora_path": "/path/to/lora.safetensors",
272
+ "target": "all",
273
+ "strength": 0.8
274
+ }'
275
+ ```
276
+
277
+ **Multiple LoRA Example:**
278
+
279
+ ```bash
280
+ curl -X POST http://localhost:30010/v1/set_lora \
281
+ -H "Content-Type: application/json" \
282
+ -d '{
283
+ "lora_nickname": ["lora_1", "lora_2"],
284
+ "lora_path": ["/path/to/lora1.safetensors", "/path/to/lora2.safetensors"],
285
+ "target": ["transformer", "transformer_2"],
286
+ "strength": [0.8, 1.0]
287
+ }'
288
+ ```
289
+
290
+ **Multiple LoRA with Same Target:**
291
+
292
+ ```bash
293
+ curl -X POST http://localhost:30010/v1/set_lora \
294
+ -H "Content-Type: application/json" \
295
+ -d '{
296
+ "lora_nickname": ["style_lora", "character_lora"],
297
+ "lora_path": ["/path/to/style.safetensors", "/path/to/character.safetensors"],
298
+ "target": "all",
299
+ "strength": [0.7, 0.9]
300
+ }'
301
+ ```
302
+
303
+ > [!NOTE]
304
+ > When using multiple LoRAs:
305
+ > - All list parameters (`lora_nickname`, `lora_path`, `target`, `strength`) must have the same length
306
+ > - If `target` or `strength` is a single value, it will be applied to all LoRAs
307
+ > - Multiple LoRAs applied to the same target will be merged in order
308
+
309
+
310
+ **Merge LoRA Weights**
311
+
312
+ Manually merges the currently set LoRA weights into the base model.
313
+
314
+ > [!NOTE]
315
+ > `set_lora` automatically performs a merge, so this is typically only needed if you have manually unmerged but want to re-apply the same LoRA without calling `set_lora` again.*
316
+
317
+ **Endpoint:** `POST /v1/merge_lora_weights`
318
+
319
+ **Parameters:**
320
+ - `target` (string, optional): Which transformer(s) to merge. One of "all" (default), "transformer", "transformer_2", "critic"
321
+ - `strength` (float, optional): LoRA strength for merge, default 1.0. Values < 1.0 reduce the effect, values > 1.0 amplify the effect
322
+
323
+ **Curl Example:**
324
+
325
+ ```bash
326
+ curl -X POST http://localhost:30010/v1/merge_lora_weights \
327
+ -H "Content-Type: application/json" \
328
+ -d '{"strength": 0.8}'
329
+ ```
330
+
331
+
332
+ **Unmerge LoRA Weights**
333
+
334
+ Unmerges the currently active LoRA weights from the base model, restoring it to its original state. This **must** be called before setting a different LoRA.
335
+
336
+ **Endpoint:** `POST /v1/unmerge_lora_weights`
337
+
338
+ **Curl Example:**
339
+
340
+ ```bash
341
+ curl -X POST http://localhost:30010/v1/unmerge_lora_weights \
342
+ -H "Content-Type: application/json"
343
+ ```
344
+
345
+ **List LoRA Adapters**
346
+
347
+ Returns loaded LoRA adapters and current application status per module.
348
+
349
+ **Endpoint:** `GET /v1/list_loras`
350
+
351
+ **Curl Example:**
352
+
353
+ ```bash
354
+ curl -sS -X GET "http://localhost:30010/v1/list_loras"
355
+ ```
356
+
357
+ **Response Example:**
358
+
359
+ ```json
360
+ {
361
+ "loaded_adapters": [
362
+ { "nickname": "lora_a", "path": "/weights/lora_a.safetensors" },
363
+ { "nickname": "lora_b", "path": "/weights/lora_b.safetensors" }
364
+ ],
365
+ "active": {
366
+ "transformer": [
367
+ {
368
+ "nickname": "lora2",
369
+ "path": "tarn59/pixel_art_style_lora_z_image_turbo",
370
+ "merged": true,
371
+ "strength": 1.0
372
+ }
373
+ ]
374
+ }
375
+ }
376
+ ```
377
+
378
+ Notes:
379
+ - If LoRA is not enabled for the current pipeline, the server will return an error.
380
+ - `num_lora_layers_with_weights` counts only layers that have LoRA weights applied for the active adapter.
381
+
382
+ ### Example: Switching LoRAs
383
+
384
+ 1. Set LoRA A:
385
+ ```bash
386
+ curl -X POST http://localhost:30010/v1/set_lora -d '{"lora_nickname": "lora_a", "lora_path": "path/to/A"}'
387
+ ```
388
+ 2. Generate with LoRA A...
389
+ 3. Unmerge LoRA A:
390
+ ```bash
391
+ curl -X POST http://localhost:30010/v1/unmerge_lora_weights
392
+ ```
393
+ 4. Set LoRA B:
394
+ ```bash
395
+ curl -X POST http://localhost:30010/v1/set_lora -d '{"lora_nickname": "lora_b", "lora_path": "path/to/B"}'
396
+ ```
397
+ 5. Generate with LoRA B...
398
+
399
+ ### Adjust Output Quality
400
+
401
+ The server supports adjusting output quality and compression levels for both image and video generation through the `output-quality` and `output-compression` parameters.
402
+
403
+ #### Parameters
404
+
405
+ - **`output-quality`** (string, optional): Preset quality level that automatically sets compression. **Default is `"default"`**. Valid values:
406
+ - `"maximum"`: Highest quality (100)
407
+ - `"high"`: High quality (90)
408
+ - `"medium"`: Medium quality (55)
409
+ - `"low"`: Lower quality (35)
410
+ - `"default"`: Auto-adjust based on media type (50 for video, 75 for image)
411
+
412
+ - **`output-compression`** (integer, optional): Direct compression level override (0-100). **Default is `None`**. When provided (not `None`), takes precedence over `output-quality`.
413
+ - `0`: Lowest quality, smallest file size
414
+ - `100`: Highest quality, largest file size
415
+
416
+ #### Notes
417
+
418
+ - **Precedence**: When both `output-quality` and `output-compression` are provided, `output-compression` takes precedence
419
+ - **Format Support**: Quality settings apply to JPEG, and video formats. PNG uses lossless compression and ignores these settings
420
+ - **File Size vs Quality**: Lower compression values (or "low" quality preset) produce smaller files but may show visible artifacts
sglang/docs/diffusion/ci_perf.md ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## Perf Baseline Generation Script
2
+
3
+ `python/sglang/multimodal_gen/test/scripts/gen_perf_baselines.py` starts a local diffusion server, issues requests for selected test cases, aggregates stage/denoise-step/E2E timings from the perf log, and writes the results back to the `scenarios` section of `perf_baselines.json`.
4
+
5
+ ### Usage
6
+
7
+ Update a single case:
8
+
9
+ ```bash
10
+ python python/sglang/multimodal_gen/test/scripts/gen_perf_baselines.py --case qwen_image_t2i
11
+ ```
12
+
13
+ Select by regex:
14
+
15
+ ```bash
16
+ python python/sglang/multimodal_gen/test/scripts/gen_perf_baselines.py --match 'qwen_image_.*'
17
+ ```
18
+
19
+ Run all keys from the baseline file `scenarios`:
20
+
21
+ ```bash
22
+ python python/sglang/multimodal_gen/test/scripts/gen_perf_baselines.py --all-from-baseline
23
+ ```
24
+
25
+ Specify input/output paths and timeout:
26
+
27
+ ```bash
28
+ python python/sglang/multimodal_gen/test/scripts/gen_perf_baselines.py --baseline python/sglang/multimodal_gen/test/server/perf_baselines.json --out /tmp/perf_baselines.json --timeout 600
29
+ ```
sglang/docs/diffusion/compatibility_matrix.md ADDED
@@ -0,0 +1,78 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Compatibility Matrix
2
+
3
+ The table below shows every supported model and the optimizations supported for them.
4
+
5
+ The symbols used have the following meanings:
6
+
7
+ - ✅ = Full compatibility
8
+ - ❌ = No compatibility
9
+ - ⭕ = Does not apply to this model
10
+
11
+ ## Models x Optimization
12
+
13
+ The `HuggingFace Model ID` can be passed directly to `from_pretrained()` methods, and sglang-diffusion will use the
14
+ optimal
15
+ default parameters when initializing and generating videos.
16
+
17
+ ### Video Generation Models
18
+
19
+ | Model Name | Hugging Face Model ID | Resolutions | TeaCache | Sliding Tile Attn | Sage Attn | Video Sparse Attention (VSA) | Sparse Linear Attention (SLA) | Sage Sparse Linear Attention (SageSLA) | Sparse Video Gen 2 (SVG2) |
20
+ |:-----------------------------|:--------------------------------------------------|:--------------------|:--------:|:-----------------:|:---------:|:----------------------------:|:----------------------------:|:-----------------------------------------------:|:----------------------------------:|
21
+ | FastWan2.1 T2V 1.3B | `FastVideo/FastWan2.1-T2V-1.3B-Diffusers` | 480p | ⭕ | ⭕ | ⭕ | ✅ | ❌ | ❌ | ❌ |
22
+ | FastWan2.2 TI2V 5B Full Attn | `FastVideo/FastWan2.2-TI2V-5B-FullAttn-Diffusers` | 720p | ⭕ | ⭕ | ⭕ | ✅ | ❌ | ❌ | ❌ |
23
+ | Wan2.2 TI2V 5B | `Wan-AI/Wan2.2-TI2V-5B-Diffusers` | 720p | ⭕ | ⭕ | ✅ | ⭕ | ❌ | ❌ | ❌ |
24
+ | Wan2.2 T2V A14B | `Wan-AI/Wan2.2-T2V-A14B-Diffusers` | 480p<br>720p | ❌ | ❌ | ✅ | ⭕ | ❌ | ❌ | ❌ |
25
+ | Wan2.2 I2V A14B | `Wan-AI/Wan2.2-I2V-A14B-Diffusers` | 480p<br>720p | ❌ | ❌ | ✅ | ⭕ | ❌ | ❌ | ❌ |
26
+ | HunyuanVideo | `hunyuanvideo-community/HunyuanVideo` | 720×1280<br>544×960 | ❌ | ✅ | ✅ | ⭕ | ❌ | ❌ | ✅ |
27
+ | FastHunyuan | `FastVideo/FastHunyuan-diffusers` | 720×1280<br>544×960 | ❌ | ✅ | ✅ | ⭕ | ❌ | ❌ | ✅ |
28
+ | Wan2.1 T2V 1.3B | `Wan-AI/Wan2.1-T2V-1.3B-Diffusers` | 480p | ✅ | ✅ | ✅ | ⭕ | ❌ | ❌ | ✅ |
29
+ | Wan2.1 T2V 14B | `Wan-AI/Wan2.1-T2V-14B-Diffusers` | 480p, 720p | ✅ | ✅ | ✅ | ⭕ | ❌ | ❌ | ✅ |
30
+ | Wan2.1 I2V 480P | `Wan-AI/Wan2.1-I2V-14B-480P-Diffusers` | 480p | ✅ | ✅ | ✅ | ⭕ | ❌ | ❌ | ✅ |
31
+ | Wan2.1 I2V 720P | `Wan-AI/Wan2.1-I2V-14B-720P-Diffusers` | 720p | ✅ | ✅ | ✅ | ⭕ | ❌ | ❌ | ✅ |
32
+ | TurboWan2.1 T2V 1.3B | `IPostYellow/TurboWan2.1-T2V-1.3B-Diffusers` | 480p | ✅ | ❌ | ❌ | ❌ | ✅ | ✅ | ⭕ |
33
+ | TurboWan2.1 T2V 14B | `IPostYellow/TurboWan2.1-T2V-14B-Diffusers` | 480p | ✅ | ❌ | ❌ | ❌ | ✅ | ✅ | ⭕ |
34
+ | TurboWan2.1 T2V 14B 720P | `IPostYellow/TurboWan2.1-T2V-14B-720P-Diffusers` | 720p | ✅ | ❌ | ❌ | ❌ | ✅ | ✅ | ⭕ |
35
+ | TurboWan2.2 I2V A14B | `IPostYellow/TurboWan2.2-I2V-A14B-Diffusers` | 720p | ✅ | ❌ | ❌ | ❌ | ✅ | ✅ | ⭕ |
36
+
37
+ **Note**:
38
+ 1.Wan2.2 TI2V 5B has some quality issues when performing I2V generation. We are working on fixing this issue.
39
+ 2.SageSLA Based on SpargeAttn. Install it first with `pip install git+https://github.com/thu-ml/SpargeAttn.git --no-build-isolation`
40
+
41
+ ### Image Generation Models
42
+
43
+ | Model Name | HuggingFace Model ID | Resolutions |
44
+ |:-----------------|:----------------------------------------|:---------------|
45
+ | FLUX.1-dev | `black-forest-labs/FLUX.1-dev` | Any resolution |
46
+ | FLUX.2-dev | `black-forest-labs/FLUX.2-dev` | Any resolution |
47
+ | FLUX.2-Klein | `black-forest-labs/FLUX.2-klein-4B` | Any resolution |
48
+ | Z-Image-Turbo | `Tongyi-MAI/Z-Image-Turbo` | Any resolution |
49
+ | GLM-Image | `zai-org/GLM-Image` | Any resolution |
50
+ | Qwen Image | `Qwen/Qwen-Image` | Any resolution |
51
+ | Qwen Image 2512 | `Qwen/Qwen-Image-2512` | Any resolution |
52
+ | Qwen Image Edit | `Qwen/Qwen-Image-Edit` | Any resolution |
53
+
54
+ ## Verified LoRA Examples
55
+
56
+ This section lists example LoRAs that have been explicitly tested and verified with each base model in the **SGLang Diffusion** pipeline.
57
+
58
+ > Important:
59
+ > LoRAs that are not listed here are not necessarily incompatible.
60
+ > In practice, most standard LoRAs are expected to work, especially those following common Diffusers or SD-style conventions.
61
+ > The entries below simply reflect configurations that have been manually validated by the SGLang team.
62
+
63
+ ### Verified LoRAs by Base Model
64
+
65
+ | Base Model | Supported LoRAs |
66
+ |:-----------------|:----------------|
67
+ | Wan2.2 | `lightx2v/Wan2.2-Distill-Loras`<br>`Cseti/wan2.2-14B-Arcane_Jinx-lora-v1` |
68
+ | Wan2.1 | `lightx2v/Wan2.1-Distill-Loras` |
69
+ | Z-Image-Turbo | `tarn59/pixel_art_style_lora_z_image_turbo`<br>`wcde/Z-Image-Turbo-DeJPEG-Lora` |
70
+ | Qwen-Image | `lightx2v/Qwen-Image-Lightning`<br>`flymy-ai/qwen-image-realism-lora`<br>`prithivMLmods/Qwen-Image-HeadshotX`<br>`starsfriday/Qwen-Image-EVA-LoRA` |
71
+ | Qwen-Image-Edit | `ostris/qwen_image_edit_inpainting`<br>`lightx2v/Qwen-Image-Edit-2511-Lightning` |
72
+ | Flux | `dvyio/flux-lora-simple-illustration`<br>`XLabs-AI/flux-furry-lora`<br>`XLabs-AI/flux-RealismLora` |
73
+
74
+ ## Special requirements
75
+
76
+ ### Sliding Tile Attention
77
+
78
+ - Currently, only Hopper GPUs (H100s) are supported.
sglang/docs/diffusion/contributing.md ADDED
@@ -0,0 +1,67 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Contributing to SGLang Diffusion
2
+
3
+ This guide outlines the requirements for contributing to the SGLang Diffusion module (`sglang.multimodal_gen`).
4
+
5
+ ## On AI-Assisted ("Vibe Coding") PRs
6
+
7
+ Vibe-coded PRs are welcome — we judge code quality, not how it was produced. The bar is the same for all PRs:
8
+
9
+ - **No over-commenting.** If the name says it all, skip the docstring.
10
+ - **No over-catching.** Don't guard against errors that virtually never happen in practice.
11
+ - **Test before submitting.** AI-generated code can be subtly wrong — verify correctness end-to-end.
12
+
13
+ ## Commit Message Convention
14
+
15
+ We follow a structured commit message format to maintain a clean history.
16
+
17
+ **Format:**
18
+ ```text
19
+ [diffusion] <scope>: <subject>
20
+ ```
21
+
22
+ **Examples:**
23
+ - `[diffusion] cli: add --perf-dump-path argument`
24
+ - `[diffusion] scheduler: fix deadlock in batch processing`
25
+ - `[diffusion] model: support Stable Diffusion 3.5`
26
+
27
+ **Rules:**
28
+ - **Prefix**: Always start with `[diffusion]`.
29
+ - **Scope** (Optional): `cli`, `scheduler`, `model`, `pipeline`, `docs`, etc.
30
+ - **Subject**: Imperative mood, short and clear (e.g., "add feature" not "added feature").
31
+
32
+ ## Performance Reporting
33
+
34
+ For PRs that impact **latency**, **throughput**, or **memory usage**, you **should** provide a performance comparison report.
35
+
36
+ ### How to Generate a Report
37
+
38
+ 1. **Baseline**: run the benchmark (for a single generation task)
39
+ ```bash
40
+ $ sglang generate --model-path <model> --prompt "A benchmark prompt" --perf-dump-path baseline.json
41
+ ```
42
+
43
+ 2. **New**: run the same benchmark, without modifying any server_args or sampling_params
44
+ ```bash
45
+ $ sglang generate --model-path <model> --prompt "A benchmark prompt" --perf-dump-path new.json
46
+ ```
47
+
48
+ 3. **Compare**: run the compare script, which will print a Markdown table to the console
49
+ ```bash
50
+ $ python python/sglang/multimodal_gen/benchmarks/compare_perf.py baseline.json new.json [new2.json ...]
51
+ ### Performance Comparison Report
52
+ ...
53
+ ```
54
+ 4. **Paste**: paste the table into the PR description
55
+
56
+ ## CI-Based Change Protection
57
+
58
+ Consider adding tests to the `pr-test` or `nightly-test` suites to safeguard your changes, especially for PRs that:
59
+
60
+ - support a new model
61
+ - add a testcase for this new model to `testcase_configs.py`
62
+ - support or fix important features
63
+ - significantly improve performance
64
+
65
+ Please run the according testcase, then update/add the baseline to `perf_baselines.json` by following the instruction in console if applicable.
66
+
67
+ See [test](https://github.com/sgl-project/sglang/tree/main/python/sglang/multimodal_gen/test) for examples
sglang/docs/diffusion/environment_variables.md ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## Caching Acceleration
2
+
3
+ These variables configure caching acceleration for Diffusion Transformer (DiT) models.
4
+ SGLang supports multiple caching strategies - see [caching documentation](performance/cache/index.md) for an overview.
5
+
6
+ ### Cache-DiT Configuration
7
+
8
+ See [cache-dit documentation](performance/cache/cache_dit.md) for detailed configuration.
9
+
10
+ | Environment Variable | Default | Description |
11
+ |-------------------------------------|---------|------------------------------------------|
12
+ | `SGLANG_CACHE_DIT_ENABLED` | false | Enable Cache-DiT acceleration |
13
+ | `SGLANG_CACHE_DIT_FN` | 1 | First N blocks to always compute |
14
+ | `SGLANG_CACHE_DIT_BN` | 0 | Last N blocks to always compute |
15
+ | `SGLANG_CACHE_DIT_WARMUP` | 4 | Warmup steps before caching |
16
+ | `SGLANG_CACHE_DIT_RDT` | 0.24 | Residual difference threshold |
17
+ | `SGLANG_CACHE_DIT_MC` | 3 | Max continuous cached steps |
18
+ | `SGLANG_CACHE_DIT_TAYLORSEER` | false | Enable TaylorSeer calibrator |
19
+ | `SGLANG_CACHE_DIT_TS_ORDER` | 1 | TaylorSeer order (1 or 2) |
20
+ | `SGLANG_CACHE_DIT_SCM_PRESET` | none | SCM preset (none/slow/medium/fast/ultra) |
21
+ | `SGLANG_CACHE_DIT_SCM_POLICY` | dynamic | SCM caching policy |
22
+ | `SGLANG_CACHE_DIT_SCM_COMPUTE_BINS` | not set | Custom SCM compute bins |
23
+ | `SGLANG_CACHE_DIT_SCM_CACHE_BINS` | not set | Custom SCM cache bins |
24
+
25
+ ## Cloud Storage
26
+
27
+ These variables configure S3-compatible cloud storage for automatically uploading generated images and videos.
28
+
29
+ | Environment Variable | Default | Description |
30
+ |---------------------------------|---------|--------------------------------------------------------|
31
+ | `SGLANG_CLOUD_STORAGE_TYPE` | not set | Set to `s3` to enable cloud storage |
32
+ | `SGLANG_S3_BUCKET_NAME` | not set | The name of the S3 bucket |
33
+ | `SGLANG_S3_ENDPOINT_URL` | not set | Custom endpoint URL (for MinIO, OSS, etc.) |
34
+ | `SGLANG_S3_REGION_NAME` | us-east-1 | AWS region name |
35
+ | `SGLANG_S3_ACCESS_KEY_ID` | not set | AWS Access Key ID |
36
+ | `SGLANG_S3_SECRET_ACCESS_KEY` | not set | AWS Secret Access Key |
sglang/docs/diffusion/index.md ADDED
@@ -0,0 +1,98 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # SGLang Diffusion
2
+
3
+ SGLang Diffusion is an inference framework for accelerated image and video generation using diffusion models. It provides an end-to-end unified pipeline with optimized kernels and an efficient scheduler loop.
4
+
5
+ ## Key Features
6
+
7
+ - **Broad Model Support**: Wan series, FastWan series, Hunyuan, Qwen-Image, Qwen-Image-Edit, Flux, Z-Image, GLM-Image, and more
8
+ - **Fast Inference**: Optimized kernels, efficient scheduler loop, and Cache-DiT acceleration
9
+ - **Ease of Use**: OpenAI-compatible API, CLI, and Python SDK
10
+ - **Multi-Platform**: NVIDIA GPUs (H100, H200, A100, B200, 4090), AMD GPUs (MI300X, MI325X) and Ascend NPU (A2, A3)
11
+
12
+ ---
13
+
14
+ ## Quick Start
15
+
16
+ ### Installation
17
+
18
+ ```bash
19
+ uv pip install "sglang[diffusion]" --prerelease=allow
20
+ ```
21
+
22
+ See [Installation Guide](installation.md) for more installation methods and ROCm-specific instructions.
23
+
24
+ ### Basic Usage
25
+
26
+ Generate an image with the CLI:
27
+
28
+ ```bash
29
+ sglang generate --model-path Qwen/Qwen-Image \
30
+ --prompt "A beautiful sunset over the mountains" \
31
+ --save-output
32
+ ```
33
+
34
+ Or start a server with the OpenAI-compatible API:
35
+
36
+ ```bash
37
+ sglang serve --model-path Qwen/Qwen-Image --port 30010
38
+ ```
39
+
40
+ ---
41
+
42
+ ## Documentation
43
+
44
+ ### Getting Started
45
+
46
+ - **[Installation](installation.md)** - Install SGLang Diffusion via pip, uv, Docker, or from source
47
+ - **[Compatibility Matrix](compatibility_matrix.md)** - Supported models and optimization compatibility
48
+
49
+ ### Usage
50
+
51
+ - **[CLI Documentation](api/cli.md)** - Command-line interface for `sglang generate` and `sglang serve`
52
+ - **[OpenAI API](api/openai_api.md)** - OpenAI-compatible API for image/video generation and LoRA management
53
+
54
+ ### Performance Optimization
55
+
56
+ - **[Performance Overview](performance/index.md)** - Overview of all performance optimization strategies
57
+ - **[Attention Backends](performance/attention_backends.md)** - Available attention backends (FlashAttention, SageAttention, etc.)
58
+ - **[Caching Strategies](performance/cache/)** - Cache-DiT and TeaCache acceleration
59
+ - **[Profiling](performance/profiling.md)** - Profiling techniques with PyTorch Profiler and Nsight Systems
60
+
61
+ ### Reference
62
+
63
+ - **[Environment Variables](environment_variables.md)** - Configuration via environment variables
64
+ - **[Support New Models](support_new_models.md)** - Guide for adding new diffusion models
65
+ - **[Contributing](contributing.md)** - Contribution guidelines and commit message conventions
66
+ - **[CI Performance](ci_perf.md)** - Performance baseline generation script
67
+
68
+ ---
69
+
70
+ ## CLI Quick Reference
71
+
72
+ ### Generate (one-off generation)
73
+
74
+ ```bash
75
+ sglang generate --model-path <MODEL> --prompt "<PROMPT>" --save-output
76
+ ```
77
+
78
+ ### Serve (HTTP server)
79
+
80
+ ```bash
81
+ sglang serve --model-path <MODEL> --port 30010
82
+ ```
83
+
84
+ ### Enable Cache-DiT acceleration
85
+
86
+ ```bash
87
+ SGLANG_CACHE_DIT_ENABLED=true sglang generate --model-path <MODEL> --prompt "<PROMPT>"
88
+ ```
89
+
90
+ ---
91
+
92
+ ## References
93
+
94
+ - [SGLang GitHub](https://github.com/sgl-project/sglang)
95
+ - [Cache-DiT](https://github.com/vipshop/cache-dit)
96
+ - [FastVideo](https://github.com/hao-ai-lab/FastVideo)
97
+ - [xDiT](https://github.com/xdit-project/xDiT)
98
+ - [Diffusers](https://github.com/huggingface/diffusers)
sglang/docs/diffusion/installation.md ADDED
@@ -0,0 +1,95 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Install SGLang-Diffusion
2
+
3
+ You can install SGLang-Diffusion using one of the methods below.
4
+
5
+ ## Standard Installation (NVIDIA GPUs)
6
+
7
+ ### Method 1: With pip or uv
8
+
9
+ It is recommended to use uv for a faster installation:
10
+
11
+ ```bash
12
+ pip install --upgrade pip
13
+ pip install uv
14
+ uv pip install "sglang[diffusion]" --prerelease=allow
15
+ ```
16
+
17
+ ### Method 2: From source
18
+
19
+ ```bash
20
+ # Use the latest release branch
21
+ git clone https://github.com/sgl-project/sglang.git
22
+ cd sglang
23
+
24
+ # Install the Python packages
25
+ pip install --upgrade pip
26
+ pip install -e "python[diffusion]"
27
+
28
+ # With uv
29
+ uv pip install -e "python[diffusion]" --prerelease=allow
30
+ ```
31
+
32
+ ### Method 3: Using Docker
33
+
34
+ The Docker images are available on Docker Hub at [lmsysorg/sglang](https://hub.docker.com/r/lmsysorg/sglang), built from the [Dockerfile](https://github.com/sgl-project/sglang/blob/main/docker/Dockerfile).
35
+ Replace `<secret>` below with your HuggingFace Hub [token](https://huggingface.co/docs/hub/en/security-tokens).
36
+
37
+ ```bash
38
+ docker run --gpus all \
39
+ --shm-size 32g \
40
+ -p 30000:30000 \
41
+ -v ~/.cache/huggingface:/root/.cache/huggingface \
42
+ --env "HF_TOKEN=<secret>" \
43
+ --ipc=host \
44
+ lmsysorg/sglang:dev \
45
+ zsh -c '\
46
+ echo "Installing diffusion dependencies..." && \
47
+ pip install -e "python[diffusion]" && \
48
+ echo "Starting SGLang-Diffusion..." && \
49
+ sglang generate \
50
+ --model-path black-forest-labs/FLUX.1-dev \
51
+ --prompt "A logo With Bold Large text: SGL Diffusion" \
52
+ --save-output \
53
+ '
54
+ ```
55
+
56
+ ## Platform-Specific: ROCm (AMD GPUs)
57
+
58
+ For AMD Instinct GPUs (e.g., MI300X), you can use the ROCm-enabled Docker image:
59
+
60
+ ```bash
61
+ docker run --device=/dev/kfd --device=/dev/dri --ipc=host \
62
+ -v ~/.cache/huggingface:/root/.cache/huggingface \
63
+ --env HF_TOKEN=<secret> \
64
+ lmsysorg/sglang:v0.5.5.post2-rocm700-mi30x \
65
+ sglang generate --model-path black-forest-labs/FLUX.1-dev --prompt "A logo With Bold Large text: SGL Diffusion" --save-output
66
+ ```
67
+
68
+ For detailed ROCm system configuration and installation from source, see [AMD GPUs](../../platforms/amd_gpu.md).
69
+
70
+ ## Platform-Specific: MUSA (Moore Threads GPUs)
71
+
72
+ For Moore Threads GPUs (MTGPU) with the MUSA software stack:
73
+
74
+ ```bash
75
+ # Clone the repository
76
+ git clone https://github.com/sgl-project/sglang.git
77
+ cd sglang
78
+
79
+ # Install the Python packages
80
+ pip install --upgrade pip
81
+ rm -f python/pyproject.toml && mv python/pyproject_other.toml python/pyproject.toml
82
+ pip install -e "python[all_musa]"
83
+ ```
84
+
85
+ ## Platform-Specific: Ascend NPU
86
+
87
+ For Ascend NPU, please follow the [NPU installation guide](../platforms/ascend_npu.md).
88
+
89
+ Quick test:
90
+
91
+ ```bash
92
+ sglang generate --model-path black-forest-labs/FLUX.1-dev \
93
+ --prompt "A logo With Bold Large text: SGL Diffusion" \
94
+ --save-output
95
+ ```
sglang/docs/diffusion/performance/attention_backends.md ADDED
@@ -0,0 +1,131 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Attention Backends
2
+
3
+ This document describes the attention backends available in sglang diffusion (`sglang.multimodal_gen`) and how to select them.
4
+
5
+ ## Overview
6
+
7
+ Attention backends are defined by `AttentionBackendEnum` (`sglang.multimodal_gen.runtime.platforms.interface.AttentionBackendEnum`) and selected via the CLI flag `--attention-backend`.
8
+
9
+ Backend selection is performed by the shared attention layers (e.g. `LocalAttention` / `USPAttention` / `UlyssesAttention` in `sglang.multimodal_gen.runtime.layers.attention.layer`) and therefore applies to any model component using these layers (e.g. diffusion transformer / DiT and encoders).
10
+
11
+ When using the diffusers backend, `--attention-backend` is passed through to diffusers'
12
+ `set_attention_backend` (e.g., `flash`, `_flash_3_hub`, `sage`, `xformers`, `native`).
13
+
14
+ - **CUDA**: prefers FlashAttention (FA3/FA4) when supported; otherwise falls back to PyTorch SDPA.
15
+ - **ROCm**: uses FlashAttention when available; otherwise falls back to PyTorch SDPA.
16
+ - **MPS**: always uses PyTorch SDPA.
17
+ - **NPU**: always uses PyTorch SDPA.
18
+
19
+ ## Backend options
20
+
21
+ For SGLang-native pipelines, the CLI accepts the lowercase names of `AttentionBackendEnum`. The table below lists the backends implemented by the built-in platforms. `fa3`/`fa4` are accepted as aliases for `fa`.
22
+
23
+ | CLI value | Enum value | Notes |
24
+ |---|---|---|
25
+ | `fa` / `fa3` / `fa4` | `FA` | FlashAttention. `fa3/fa4` are normalized to `fa` during argument parsing (`ServerArgs.__post_init__`). |
26
+ | `torch_sdpa` | `TORCH_SDPA` | PyTorch `scaled_dot_product_attention`. |
27
+ | `sliding_tile_attn` | `SLIDING_TILE_ATTN` | Sliding Tile Attention (STA). Requires `st_attn`. Configure via `--attention-backend-config`. |
28
+ | `sage_attn` | `SAGE_ATTN` | Requires `sageattention`. Upstream SageAttention CUDA extensions target SM80/SM86/SM89/SM90/SM120 (compute capability 8.0/8.6/8.9/9.0/12.0); see upstream `setup.py`: https://github.com/thu-ml/SageAttention/blob/main/setup.py. |
29
+ | `sage_attn_3` | `SAGE_ATTN_3` | Requires SageAttention3 installed per upstream instructions. |
30
+ | `video_sparse_attn` | `VIDEO_SPARSE_ATTN` | Requires `vsa`. Configure `sparsity` via `--attention-backend-config`. |
31
+ | `vmoba_attn` | `VMOBA_ATTN` | Requires `kernel.attn.vmoba_attn.vmoba`. Configure via `--attention-backend-config`. |
32
+ | `aiter` | `AITER` | Requires `aiter`. |
33
+ | `sparse_video_gen_2_attn` | `SPARSE_VIDEO_GEN_2_ATTN` | Requires `svg`. See installation instructions at https://github.com/svg-project/Sparse-VideoGen. |
34
+
35
+ ## Selection priority
36
+
37
+ The selection order in `runtime/layers/attention/selector.py` is:
38
+
39
+ 1. `global_force_attn_backend(...)` / `global_force_attn_backend_context_manager(...)`
40
+ 2. CLI `--attention-backend` (`ServerArgs.attention_backend`)
41
+ 3. Auto selection (platform capability, dtype, and installed packages)
42
+
43
+ ## Configuration
44
+
45
+ Some backends require additional configuration. You can pass these parameters via `--attention-backend-config`. This argument accepts:
46
+ - A path to a JSON or YAML configuration file.
47
+ - A JSON string (e.g., `'{"sparsity": 0.5}'`).
48
+ - Key-value pairs (e.g., `"sparsity=0.5,enable_x=true"`).
49
+
50
+ ### Supported Configuration Parameters
51
+
52
+ **Sliding Tile Attention (`sliding_tile_attn`)**
53
+
54
+ | Parameter | Type | Description | Default |
55
+ | :--- | :--- | :--- | :--- |
56
+ | `mask_strategy_file_path` | `str` | **Required.** Path to the mask strategy JSON file. | - |
57
+ | `sta_mode` | `str` | Mode of STA. | `STA_inference` |
58
+ | `skip_time_steps` | `int` | Number of steps to use full attention before switching to sparse attention. | `15` |
59
+
60
+ **Video Sparse Attention (`video_sparse_attn`)**
61
+
62
+ | Parameter | Type | Description | Default |
63
+ | :--- | :--- | :--- | :--- |
64
+ | `sparsity` | `float` | Validation sparsity (0.0 - 1.0). | `0.0` |
65
+
66
+ **V-MoBA (`vmoba_attn`)**
67
+
68
+ | Parameter | Type | Description | Default |
69
+ | :--- | :--- | :--- | :--- |
70
+ | `temporal_chunk_size` | `int` | Chunk size for temporal dimension. | - |
71
+ | `temporal_topk` | `int` | Top-K tokens to select in temporal dimension. | - |
72
+ | `spatial_chunk_size` | `list[int]` | Chunk size for spatial dimension (H, W). | - |
73
+ | `spatial_topk` | `int` | Top-K tokens to select in spatial dimension. | - |
74
+ | `st_chunk_size` | `list[int]` | Chunk size for spatiotemporal dimension (T, H, W). | - |
75
+ | `st_topk` | `int` | Top-K tokens to select in spatiotemporal dimension. | - |
76
+ | `moba_select_mode` | `str` | Selection mode (e.g., `threshold`). | `threshold` |
77
+ | `moba_threshold` | `float` | Threshold value for selection. | `0.25` |
78
+ | `moba_threshold_type` | `str` | Type of thresholding (e.g., `query_head`). | `query_head` |
79
+ | `first_full_step` | `int` | Number of initial steps to use full attention. | `12` |
80
+ | `first_full_layer` | `int` | Number of initial layers to use full attention. | `0` |
81
+ | `temporal_layer` | `int` | Number of temporal layers. | `1` |
82
+ | `spatial_layer` | `int` | Number of spatial layers. | `1` |
83
+ | `st_layer` | `int` | Number of spatiotemporal layers. | `1` |
84
+
85
+ ## Platform support matrix
86
+
87
+ | Backend | CUDA | ROCm | MPS | NPU | Notes |
88
+ |---|---:|---:|---:|---:|---|
89
+ | `fa` | ✅ | ✅ | ❌ | ❌ | CUDA requires SM80+ and fp16/bf16. FlashAttention is only used when the required runtime is installed; otherwise it falls back to `torch_sdpa`. |
90
+ | `torch_sdpa` | ✅ | ✅ | ✅ | ✅ | Most compatible option across platforms. |
91
+ | `sliding_tile_attn` | ✅ | ❌ | ❌ | ❌ | CUDA-only. Requires `st_attn`. Configure via `--attention-backend-config`. |
92
+ | `sage_attn` | ✅ | ❌ | ❌ | ❌ | CUDA-only (optional dependency). |
93
+ | `sage_attn_3` | ✅ | ❌ | ❌ | ❌ | CUDA-only (optional dependency). |
94
+ | `video_sparse_attn` | ✅ | ❌ | ❌ | ❌ | CUDA-only. Requires `vsa`. Configure `sparsity` via `--attention-backend-config`. |
95
+ | `vmoba_attn` | ✅ | ❌ | ❌ | ❌ | CUDA-only. Requires `kernel.attn.vmoba_attn.vmoba`. Configure via `--attention-backend-config`. |
96
+ | `aiter` | ✅ | ❌ | ❌ | ❌ | Requires `aiter`. |
97
+ | `sparse_video_gen_2_attn` | ✅ | ❌ | ❌ | ❌ | CUDA-only. Requires `svg`. |
98
+
99
+ ## Usage
100
+
101
+ ### Select a backend via CLI
102
+
103
+ ```bash
104
+ sglang generate \
105
+ --model-path <MODEL_PATH_OR_ID> \
106
+ --prompt "..." \
107
+ --attention-backend fa
108
+ ```
109
+
110
+ ```bash
111
+ sglang generate \
112
+ --model-path <MODEL_PATH_OR_ID> \
113
+ --prompt "..." \
114
+ --attention-backend torch_sdpa
115
+ ```
116
+
117
+ ### Using Sliding Tile Attention (STA)
118
+
119
+ ```bash
120
+ # Pass the mask strategy file path via config
121
+ sglang generate \
122
+ --model-path <MODEL_PATH_OR_ID> \
123
+ --prompt "..." \
124
+ --attention-backend sliding_tile_attn \
125
+ --attention-backend-config "mask_strategy_file_path=/abs/path/to/mask_strategy.json"
126
+ ```
127
+
128
+ ### Notes for ROCm / MPS
129
+
130
+ - ROCm: use `--attention-backend torch_sdpa` or `fa` depending on what is available in your environment.
131
+ - MPS: the platform implementation always uses `torch_sdpa`.
sglang/docs/diffusion/performance/cache/cache_dit.md ADDED
@@ -0,0 +1,273 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Cache-DiT Acceleration
2
+
3
+ SGLang integrates [Cache-DiT](https://github.com/vipshop/cache-dit), a caching acceleration engine for Diffusion Transformers (DiT), to achieve up to **1.69x inference speedup** with minimal quality loss.
4
+
5
+ ## Overview
6
+
7
+ **Cache-DiT** uses intelligent caching strategies to skip redundant computation in the denoising loop:
8
+
9
+ - **DBCache (Dual Block Cache)**: Dynamically decides when to cache transformer blocks based on residual differences
10
+ - **TaylorSeer**: Uses Taylor expansion for calibration to optimize caching decisions
11
+ - **SCM (Step Computation Masking)**: Step-level caching control for additional speedup
12
+
13
+ ## Basic Usage
14
+
15
+ Enable Cache-DiT by exporting the environment variable and using `sglang generate` or `sglang serve` :
16
+
17
+ ```bash
18
+ SGLANG_CACHE_DIT_ENABLED=true \
19
+ sglang generate --model-path Qwen/Qwen-Image \
20
+ --prompt "A beautiful sunset over the mountains"
21
+ ```
22
+
23
+ ## Diffusers Backend
24
+
25
+ Cache-DiT supports loading acceleration configs from a custom YAML file. For
26
+ diffusers pipelines (`diffusers` backend), pass the YAML/JSON path via `--cache-dit-config`. This
27
+ flow requires cache-dit >= 1.2.0 (`cache_dit.load_configs`).
28
+
29
+ ### Single GPU inference
30
+
31
+ Define a `cache.yaml` file that contains:
32
+
33
+ ```yaml
34
+ cache_config:
35
+ max_warmup_steps: 8
36
+ warmup_interval: 2
37
+ max_cached_steps: -1
38
+ max_continuous_cached_steps: 2
39
+ Fn_compute_blocks: 1
40
+ Bn_compute_blocks: 0
41
+ residual_diff_threshold: 0.12
42
+ enable_taylorseer: true
43
+ taylorseer_order: 1
44
+ ```
45
+
46
+ Then apply the config with:
47
+
48
+ ```bash
49
+ sglang generate \
50
+ --backend diffusers \
51
+ --model-path Qwen/Qwen-Image \
52
+ --cache-dit-config cache.yaml \
53
+ --prompt "A beautiful sunset over the mountains"
54
+ ```
55
+
56
+ ### Distributed inference
57
+
58
+ - 1D Parallelism
59
+
60
+ Define a parallelism only config yaml `parallel.yaml` file that contains:
61
+
62
+ ```yaml
63
+ parallelism_config:
64
+ ulysses_size: auto
65
+ parallel_kwargs:
66
+ attention_backend: native
67
+ extra_parallel_modules: ["text_encoder", "vae"]
68
+ ```
69
+
70
+ Then, apply the distributed inference acceleration config from yaml. `ulysses_size: auto` means that cache-dit will auto detect the `world_size` as the ulysses_size. Otherwise, you should manually set it as specific int number, e.g, 4.
71
+
72
+ Then apply the distributed config with: (Note: please add `--num-gpus N` to specify the number of gpus for distributed inference)
73
+
74
+ ```bash
75
+ sglang generate \
76
+ --backend diffusers \
77
+ --num-gpus 4 \
78
+ --model-path Qwen/Qwen-Image \
79
+ --cache-dit-config parallel.yaml \
80
+ --prompt "A futuristic cityscape at sunset"
81
+ ```
82
+
83
+ - 2D Parallelism
84
+
85
+ You can also define a 2D parallelism config yaml `parallel_2d.yaml` file that contains:
86
+
87
+ ```yaml
88
+ parallelism_config:
89
+ ulysses_size: auto
90
+ tp_size: 2
91
+ parallel_kwargs:
92
+ attention_backend: native
93
+ extra_parallel_modules: ["text_encoder", "vae"]
94
+ ```
95
+ Then, apply the 2D parallelism config from yaml. Here `tp_size: 2` means using tensor parallelism with size 2. The `ulysses_size: auto` means that cache-dit will auto detect the `world_size // tp_size` as the ulysses_size.
96
+
97
+ - 3D Parallelism
98
+
99
+ You can also define a 3D parallelism config yaml `parallel_3d.yaml` file that contains:
100
+
101
+ ```yaml
102
+ parallelism_config:
103
+ ulysses_size: 2
104
+ ring_size: 2
105
+ tp_size: 2
106
+ parallel_kwargs:
107
+ attention_backend: native
108
+ extra_parallel_modules: ["text_encoder", "vae"]
109
+ ```
110
+ Then, apply the 3D parallelism config from yaml. Here `ulysses_size: 2`, `ring_size: 2`, `tp_size: 2` means using ulysses parallelism with size 2, ring parallelism with size 2 and tensor parallelism with size 2.
111
+
112
+ ### Hybrid Cache and Parallelism
113
+
114
+ Define a hybrid cache and parallel acceleration config yaml `hybrid.yaml` file that contains:
115
+
116
+ ```yaml
117
+ cache_config:
118
+ max_warmup_steps: 8
119
+ warmup_interval: 2
120
+ max_cached_steps: -1
121
+ max_continuous_cached_steps: 2
122
+ Fn_compute_blocks: 1
123
+ Bn_compute_blocks: 0
124
+ residual_diff_threshold: 0.12
125
+ enable_taylorseer: true
126
+ taylorseer_order: 1
127
+ parallelism_config:
128
+ ulysses_size: auto
129
+ parallel_kwargs:
130
+ attention_backend: native
131
+ extra_parallel_modules: ["text_encoder", "vae"]
132
+ ```
133
+
134
+ Then, apply the hybrid cache and parallel acceleration config from yaml.
135
+
136
+ ```bash
137
+ sglang generate \
138
+ --backend diffusers \
139
+ --num-gpus 4 \
140
+ --model-path Qwen/Qwen-Image \
141
+ --cache-dit-config hybrid.yaml \
142
+ --prompt "A beautiful sunset over the mountains"
143
+ ```
144
+
145
+ ## Advanced Configuration
146
+
147
+ ### DBCache Parameters
148
+
149
+ DBCache controls block-level caching behavior:
150
+
151
+ | Parameter | Env Variable | Default | Description |
152
+ |-----------|---------------------------|---------|------------------------------------------|
153
+ | Fn | `SGLANG_CACHE_DIT_FN` | 1 | Number of first blocks to always compute |
154
+ | Bn | `SGLANG_CACHE_DIT_BN` | 0 | Number of last blocks to always compute |
155
+ | W | `SGLANG_CACHE_DIT_WARMUP` | 4 | Warmup steps before caching starts |
156
+ | R | `SGLANG_CACHE_DIT_RDT` | 0.24 | Residual difference threshold |
157
+ | MC | `SGLANG_CACHE_DIT_MC` | 3 | Maximum continuous cached steps |
158
+
159
+ ### TaylorSeer Configuration
160
+
161
+ TaylorSeer improves caching accuracy using Taylor expansion:
162
+
163
+ | Parameter | Env Variable | Default | Description |
164
+ |-----------|-------------------------------|---------|---------------------------------|
165
+ | Enable | `SGLANG_CACHE_DIT_TAYLORSEER` | false | Enable TaylorSeer calibrator |
166
+ | Order | `SGLANG_CACHE_DIT_TS_ORDER` | 1 | Taylor expansion order (1 or 2) |
167
+
168
+ ### Combined Configuration Example
169
+
170
+ DBCache and TaylorSeer are complementary strategies that work together, you can configure both sets of parameters
171
+ simultaneously:
172
+
173
+ ```bash
174
+ SGLANG_CACHE_DIT_ENABLED=true \
175
+ SGLANG_CACHE_DIT_FN=2 \
176
+ SGLANG_CACHE_DIT_BN=1 \
177
+ SGLANG_CACHE_DIT_WARMUP=4 \
178
+ SGLANG_CACHE_DIT_RDT=0.4 \
179
+ SGLANG_CACHE_DIT_MC=4 \
180
+ SGLANG_CACHE_DIT_TAYLORSEER=true \
181
+ SGLANG_CACHE_DIT_TS_ORDER=2 \
182
+ sglang generate --model-path black-forest-labs/FLUX.1-dev \
183
+ --prompt "A curious raccoon in a forest"
184
+ ```
185
+
186
+ ### SCM (Step Computation Masking)
187
+
188
+ SCM provides step-level caching control for additional speedup. It decides which denoising steps to compute fully and
189
+ which to use cached results.
190
+
191
+ **SCM Presets**
192
+
193
+ SCM is configured with presets:
194
+
195
+ | Preset | Compute Ratio | Speed | Quality |
196
+ |----------|---------------|----------|------------|
197
+ | `none` | 100% | Baseline | Best |
198
+ | `slow` | ~75% | ~1.3x | High |
199
+ | `medium` | ~50% | ~2x | Good |
200
+ | `fast` | ~35% | ~3x | Acceptable |
201
+ | `ultra` | ~25% | ~4x | Lower |
202
+
203
+ **Usage**
204
+
205
+ ```bash
206
+ SGLANG_CACHE_DIT_ENABLED=true \
207
+ SGLANG_CACHE_DIT_SCM_PRESET=medium \
208
+ sglang generate --model-path Qwen/Qwen-Image \
209
+ --prompt "A futuristic cityscape at sunset"
210
+ ```
211
+
212
+ **Custom SCM Bins**
213
+
214
+ For fine-grained control over which steps to compute vs cache:
215
+
216
+ ```bash
217
+ SGLANG_CACHE_DIT_ENABLED=true \
218
+ SGLANG_CACHE_DIT_SCM_COMPUTE_BINS="8,3,3,2,2" \
219
+ SGLANG_CACHE_DIT_SCM_CACHE_BINS="1,2,2,2,3" \
220
+ sglang generate --model-path Qwen/Qwen-Image \
221
+ --prompt "A futuristic cityscape at sunset"
222
+ ```
223
+
224
+ **SCM Policy**
225
+
226
+ | Policy | Env Variable | Description |
227
+ |-----------|---------------------------------------|---------------------------------------------|
228
+ | `dynamic` | `SGLANG_CACHE_DIT_SCM_POLICY=dynamic` | Adaptive caching based on content (default) |
229
+ | `static` | `SGLANG_CACHE_DIT_SCM_POLICY=static` | Fixed caching pattern |
230
+
231
+ ## Environment Variables
232
+
233
+ All Cache-DiT parameters can be configured via environment variables.
234
+ See [Environment Variables](../../environment_variables.md) for the complete list.
235
+
236
+ ## Supported Models
237
+
238
+ SGLang Diffusion x Cache-DiT supports almost all models originally supported in SGLang Diffusion:
239
+
240
+ | Model Family | Example Models |
241
+ |--------------|-----------------------------|
242
+ | Wan | Wan2.1, Wan2.2 |
243
+ | Flux | FLUX.1-dev, FLUX.2-dev |
244
+ | Z-Image | Z-Image-Turbo |
245
+ | Qwen | Qwen-Image, Qwen-Image-Edit |
246
+ | Hunyuan | HunyuanVideo |
247
+
248
+ ## Performance Tips
249
+
250
+ 1. **Start with defaults**: The default parameters work well for most models
251
+ 2. **Use TaylorSeer**: It typically improves both speed and quality
252
+ 3. **Tune R threshold**: Lower values = better quality, higher values = faster
253
+ 4. **SCM for extra speed**: Use `medium` preset for good speed/quality balance
254
+ 5. **Warmup matters**: Higher warmup = more stable caching decisions
255
+
256
+ ## Limitations
257
+
258
+ - **SGLang-native pipelines**: Distributed support (TP/SP) is not yet validated; Cache-DiT will be automatically
259
+ disabled when `world_size > 1`.
260
+ - **SCM minimum steps**: SCM requires >= 8 inference steps to be effective
261
+ - **Model support**: Only models registered in Cache-DiT's BlockAdapterRegister are supported
262
+
263
+ ## Troubleshooting
264
+
265
+ ### SCM disabled for low step count
266
+
267
+ For models with < 8 inference steps (e.g., DMD distilled models), SCM will be automatically disabled. DBCache
268
+ acceleration still works.
269
+
270
+ ## References
271
+
272
+ - [Cache-DiT](https://github.com/vipshop/cache-dit)
273
+ - [SGLang Diffusion](../index.md)
sglang/docs/diffusion/performance/cache/index.md ADDED
@@ -0,0 +1,60 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Caching Acceleration for Diffusion Models
2
+
3
+ SGLang provides multiple caching acceleration strategies for Diffusion Transformer (DiT) models. These strategies can significantly reduce inference time by skipping redundant computation.
4
+
5
+ ## Overview
6
+
7
+ SGLang supports two complementary caching approaches:
8
+
9
+ | Strategy | Scope | Mechanism | Best For |
10
+ |----------|-------|-----------|----------|
11
+ | **Cache-DiT** | Block-level | Skip individual transformer blocks dynamically | Advanced, higher speedup |
12
+ | **TeaCache** | Timestep-level | Skip entire denoising steps based on L1 similarity | Simple, built-in |
13
+
14
+
15
+
16
+ ## Cache-DiT
17
+
18
+ [Cache-DiT](https://github.com/vipshop/cache-dit) provides block-level caching with
19
+ advanced strategies like DBCache and TaylorSeer. It can achieve up to **1.69x speedup**.
20
+
21
+ See [cache_dit.md](cache_dit.md) for detailed configuration.
22
+
23
+ ### Quick Start
24
+
25
+ ```bash
26
+ SGLANG_CACHE_DIT_ENABLED=true \
27
+ sglang generate --model-path Qwen/Qwen-Image \
28
+ --prompt "A beautiful sunset over the mountains"
29
+ ```
30
+
31
+ ### Key Features
32
+
33
+ - **DBCache**: Dynamic block-level caching based on residual differences
34
+ - **TaylorSeer**: Taylor expansion-based calibration for optimized caching
35
+ - **SCM**: Step-level computation masking for additional speedup
36
+
37
+ ## TeaCache
38
+
39
+ TeaCache (Temporal similarity-based caching) accelerates diffusion inference by detecting when consecutive denoising steps are similar enough to skip computation entirely.
40
+
41
+ See [teacache.md](teacache.md) for detailed documentation.
42
+
43
+ ### Quick Overview
44
+
45
+ - Tracks L1 distance between modulated inputs across timesteps
46
+ - When accumulated distance is below threshold, reuses cached residual
47
+ - Supports CFG with separate positive/negative caches
48
+
49
+ ### Supported Models
50
+
51
+ - Wan (wan2.1, wan2.2)
52
+ - Hunyuan (HunyuanVideo)
53
+ - Z-Image
54
+
55
+ For Flux and Qwen models, TeaCache is automatically disabled when CFG is enabled.
56
+
57
+ ## References
58
+
59
+ - [Cache-DiT Repository](https://github.com/vipshop/cache-dit)
60
+ - [TeaCache Paper](https://arxiv.org/abs/2411.14324)
sglang/docs/diffusion/performance/cache/teacache.md ADDED
@@ -0,0 +1,84 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # TeaCache Acceleration
2
+
3
+ > **Note**: This is one of two caching strategies available in SGLang.
4
+ > For an overview of all caching options, see [caching](../index.md).
5
+
6
+ TeaCache (Temporal similarity-based caching) accelerates diffusion inference by detecting when consecutive denoising steps are similar enough to skip computation entirely.
7
+
8
+ ## Overview
9
+
10
+ TeaCache works by:
11
+ 1. Tracking the L1 distance between modulated inputs across consecutive timesteps
12
+ 2. Accumulating the rescaled L1 distance over steps
13
+ 3. When accumulated distance is below a threshold, reusing the cached residual
14
+ 4. Supporting CFG (Classifier-Free Guidance) with separate positive/negative caches
15
+
16
+ ## How It Works
17
+
18
+ ### L1 Distance Tracking
19
+
20
+ At each denoising step, TeaCache computes the relative L1 distance between the current and previous modulated inputs:
21
+
22
+ ```
23
+ rel_l1 = |current - previous|.mean() / |previous|.mean()
24
+ ```
25
+
26
+ This distance is then rescaled using polynomial coefficients and accumulated:
27
+
28
+ ```
29
+ accumulated += poly(coefficients)(rel_l1)
30
+ ```
31
+
32
+ ### Cache Decision
33
+
34
+ - If `accumulated >= threshold`: Force computation, reset accumulator
35
+ - If `accumulated < threshold`: Skip computation, use cached residual
36
+
37
+ ### CFG Support
38
+
39
+ For models that support CFG cache separation (Wan, Hunyuan, Z-Image), TeaCache maintains separate caches for positive and negative branches:
40
+ - `previous_modulated_input` / `previous_residual` for positive branch
41
+ - `previous_modulated_input_negative` / `previous_residual_negative` for negative branch
42
+
43
+ For models that don't support CFG separation (Flux, Qwen), TeaCache is automatically disabled when CFG is enabled.
44
+
45
+ ## Configuration
46
+
47
+ TeaCache is configured via `TeaCacheParams` in the sampling parameters:
48
+
49
+ ```python
50
+ from sglang.multimodal_gen.configs.sample.teacache import TeaCacheParams
51
+
52
+ params = TeaCacheParams(
53
+ teacache_thresh=0.1, # Threshold for accumulated L1 distance
54
+ coefficients=[1.0, 0.0, 0.0], # Polynomial coefficients for L1 rescaling
55
+ )
56
+ ```
57
+
58
+ ### Parameters
59
+
60
+ | Parameter | Type | Description |
61
+ |-----------|------|-------------|
62
+ | `teacache_thresh` | float | Threshold for accumulated L1 distance. Lower = more caching, faster but potentially lower quality |
63
+ | `coefficients` | list[float] | Polynomial coefficients for L1 rescaling. Model-specific tuning |
64
+
65
+ ### Model-Specific Configurations
66
+
67
+ Different models may have different optimal configurations. The coefficients are typically tuned per-model to balance speed and quality.
68
+
69
+ ## Supported Models
70
+
71
+ TeaCache is built into the following model families:
72
+
73
+ | Model Family | CFG Cache Separation | Notes |
74
+ |--------------|---------------------|-------|
75
+ | Wan (wan2.1, wan2.2) | Yes | Full support |
76
+ | Hunyuan (HunyuanVideo) | Yes | To be supported |
77
+ | Z-Image | Yes | To be supported |
78
+ | Flux | No | To be supported |
79
+ | Qwen | No | To be supported |
80
+
81
+
82
+ ## References
83
+
84
+ - [TeaCache: Accelerating Diffusion Models with Temporal Similarity](https://arxiv.org/abs/2411.14324)
sglang/docs/diffusion/performance/index.md ADDED
@@ -0,0 +1,72 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Performance Optimization
2
+
3
+ SGLang-Diffusion provides multiple performance optimization strategies to accelerate inference. This section covers all available performance tuning options.
4
+
5
+ ## Overview
6
+
7
+ | Optimization | Type | Description |
8
+ |--------------|------|-------------|
9
+ | **Cache-DiT** | Caching | Block-level caching with DBCache, TaylorSeer, and SCM |
10
+ | **TeaCache** | Caching | Timestep-level caching using L1 similarity |
11
+ | **Attention Backends** | Kernel | Optimized attention implementations (FlashAttention, SageAttention, etc.) |
12
+ | **Profiling** | Diagnostics | PyTorch Profiler and Nsight Systems guidance |
13
+
14
+ ## Caching Strategies
15
+
16
+ SGLang supports two complementary caching approaches:
17
+
18
+ ### Cache-DiT
19
+
20
+ [Cache-DiT](https://github.com/vipshop/cache-dit) provides block-level caching with advanced strategies. It can achieve up to **1.69x speedup**.
21
+
22
+ **Quick Start:**
23
+ ```bash
24
+ SGLANG_CACHE_DIT_ENABLED=true \
25
+ sglang generate --model-path Qwen/Qwen-Image \
26
+ --prompt "A beautiful sunset over the mountains"
27
+ ```
28
+
29
+ **Key Features:**
30
+ - **DBCache**: Dynamic block-level caching based on residual differences
31
+ - **TaylorSeer**: Taylor expansion-based calibration for optimized caching
32
+ - **SCM**: Step-level computation masking for additional speedup
33
+
34
+ See [Cache-DiT Documentation](cache/cache_dit.md) for detailed configuration.
35
+
36
+ ### TeaCache
37
+
38
+ TeaCache (Temporal similarity-based caching) accelerates diffusion inference by detecting when consecutive denoising steps are similar enough to skip computation entirely.
39
+
40
+ **Quick Overview:**
41
+ - Tracks L1 distance between modulated inputs across timesteps
42
+ - When accumulated distance is below threshold, reuses cached residual
43
+ - Supports CFG with separate positive/negative caches
44
+
45
+ **Supported Models:** Wan (wan2.1, wan2.2), Hunyuan (HunyuanVideo), Z-Image
46
+
47
+ See [TeaCache Documentation](cache/teacache.md) for detailed configuration.
48
+
49
+ ## Attention Backends
50
+
51
+ Different attention backends offer varying performance characteristics depending on your hardware and model:
52
+
53
+ - **FlashAttention**: Fastest on NVIDIA GPUs with fp16/bf16
54
+ - **SageAttention**: Alternative optimized implementation
55
+ - **xformers**: Memory-efficient attention
56
+ - **SDPA**: PyTorch native scaled dot-product attention
57
+
58
+ See [Attention Backends](attention_backends.md) for platform support and configuration options.
59
+
60
+ ## Profiling
61
+
62
+ To diagnose performance bottlenecks, SGLang-Diffusion supports profiling tools:
63
+
64
+ - **PyTorch Profiler**: Built-in Python profiling
65
+ - **Nsight Systems**: GPU kernel-level analysis
66
+
67
+ See [Profiling Guide](profiling.md) for detailed instructions.
68
+
69
+ ## References
70
+
71
+ - [Cache-DiT Repository](https://github.com/vipshop/cache-dit)
72
+ - [TeaCache Paper](https://arxiv.org/abs/2411.14324)