Attention Backends
This document describes the attention backends available in sglang diffusion (sglang.multimodal_gen) and how to select them.
Overview
Attention backends are defined by AttentionBackendEnum (sglang.multimodal_gen.runtime.platforms.interface.AttentionBackendEnum) and selected via the CLI flag --attention-backend.
Backend selection is performed by the shared attention layers (e.g. LocalAttention / USPAttention / UlyssesAttention in sglang.multimodal_gen.runtime.layers.attention.layer) and therefore applies to any model component using these layers (e.g. diffusion transformer / DiT and encoders).
When using the diffusers backend, --attention-backend is passed through to diffusers'
set_attention_backend (e.g., flash, _flash_3_hub, sage, xformers, native).
- CUDA: prefers FlashAttention (FA3/FA4) when supported; otherwise falls back to PyTorch SDPA.
- ROCm: uses FlashAttention when available; otherwise falls back to PyTorch SDPA.
- MPS: always uses PyTorch SDPA.
- NPU: always uses PyTorch SDPA.
Backend options
For SGLang-native pipelines, the CLI accepts the lowercase names of AttentionBackendEnum. The table below lists the backends implemented by the built-in platforms. fa3/fa4 are accepted as aliases for fa.
| CLI value | Enum value | Notes |
|---|---|---|
fa / fa3 / fa4 |
FA |
FlashAttention. fa3/fa4 are normalized to fa during argument parsing (ServerArgs.__post_init__). |
torch_sdpa |
TORCH_SDPA |
PyTorch scaled_dot_product_attention. |
sliding_tile_attn |
SLIDING_TILE_ATTN |
Sliding Tile Attention (STA). Requires st_attn. Configure via --attention-backend-config. |
sage_attn |
SAGE_ATTN |
Requires sageattention. Upstream SageAttention CUDA extensions target SM80/SM86/SM89/SM90/SM120 (compute capability 8.0/8.6/8.9/9.0/12.0); see upstream setup.py: https://github.com/thu-ml/SageAttention/blob/main/setup.py. |
sage_attn_3 |
SAGE_ATTN_3 |
Requires SageAttention3 installed per upstream instructions. |
video_sparse_attn |
VIDEO_SPARSE_ATTN |
Requires vsa. Configure sparsity via --attention-backend-config. |
vmoba_attn |
VMOBA_ATTN |
Requires kernel.attn.vmoba_attn.vmoba. Configure via --attention-backend-config. |
aiter |
AITER |
Requires aiter. |
sparse_video_gen_2_attn |
SPARSE_VIDEO_GEN_2_ATTN |
Requires svg. See installation instructions at https://github.com/svg-project/Sparse-VideoGen. |
Selection priority
The selection order in runtime/layers/attention/selector.py is:
global_force_attn_backend(...)/global_force_attn_backend_context_manager(...)- CLI
--attention-backend(ServerArgs.attention_backend) - Auto selection (platform capability, dtype, and installed packages)
Configuration
Some backends require additional configuration. You can pass these parameters via --attention-backend-config. This argument accepts:
- A path to a JSON or YAML configuration file.
- A JSON string (e.g.,
'{"sparsity": 0.5}'). - Key-value pairs (e.g.,
"sparsity=0.5,enable_x=true").
Supported Configuration Parameters
Sliding Tile Attention (sliding_tile_attn)
| Parameter | Type | Description | Default |
|---|---|---|---|
mask_strategy_file_path |
str |
Required. Path to the mask strategy JSON file. | - |
sta_mode |
str |
Mode of STA. | STA_inference |
skip_time_steps |
int |
Number of steps to use full attention before switching to sparse attention. | 15 |
Video Sparse Attention (video_sparse_attn)
| Parameter | Type | Description | Default |
|---|---|---|---|
sparsity |
float |
Validation sparsity (0.0 - 1.0). | 0.0 |
V-MoBA (vmoba_attn)
| Parameter | Type | Description | Default |
|---|---|---|---|
temporal_chunk_size |
int |
Chunk size for temporal dimension. | - |
temporal_topk |
int |
Top-K tokens to select in temporal dimension. | - |
spatial_chunk_size |
list[int] |
Chunk size for spatial dimension (H, W). | - |
spatial_topk |
int |
Top-K tokens to select in spatial dimension. | - |
st_chunk_size |
list[int] |
Chunk size for spatiotemporal dimension (T, H, W). | - |
st_topk |
int |
Top-K tokens to select in spatiotemporal dimension. | - |
moba_select_mode |
str |
Selection mode (e.g., threshold). |
threshold |
moba_threshold |
float |
Threshold value for selection. | 0.25 |
moba_threshold_type |
str |
Type of thresholding (e.g., query_head). |
query_head |
first_full_step |
int |
Number of initial steps to use full attention. | 12 |
first_full_layer |
int |
Number of initial layers to use full attention. | 0 |
temporal_layer |
int |
Number of temporal layers. | 1 |
spatial_layer |
int |
Number of spatial layers. | 1 |
st_layer |
int |
Number of spatiotemporal layers. | 1 |
Platform support matrix
| Backend | CUDA | ROCm | MPS | NPU | Notes |
|---|---|---|---|---|---|
fa |
β | β | β | β | CUDA requires SM80+ and fp16/bf16. FlashAttention is only used when the required runtime is installed; otherwise it falls back to torch_sdpa. |
torch_sdpa |
β | β | β | β | Most compatible option across platforms. |
sliding_tile_attn |
β | β | β | β | CUDA-only. Requires st_attn. Configure via --attention-backend-config. |
sage_attn |
β | β | β | β | CUDA-only (optional dependency). |
sage_attn_3 |
β | β | β | β | CUDA-only (optional dependency). |
video_sparse_attn |
β | β | β | β | CUDA-only. Requires vsa. Configure sparsity via --attention-backend-config. |
vmoba_attn |
β | β | β | β | CUDA-only. Requires kernel.attn.vmoba_attn.vmoba. Configure via --attention-backend-config. |
aiter |
β | β | β | β | Requires aiter. |
sparse_video_gen_2_attn |
β | β | β | β | CUDA-only. Requires svg. |
Usage
Select a backend via CLI
sglang generate \
--model-path <MODEL_PATH_OR_ID> \
--prompt "..." \
--attention-backend fa
sglang generate \
--model-path <MODEL_PATH_OR_ID> \
--prompt "..." \
--attention-backend torch_sdpa
Using Sliding Tile Attention (STA)
# Pass the mask strategy file path via config
sglang generate \
--model-path <MODEL_PATH_OR_ID> \
--prompt "..." \
--attention-backend sliding_tile_attn \
--attention-backend-config "mask_strategy_file_path=/abs/path/to/mask_strategy.json"
Notes for ROCm / MPS
- ROCm: use
--attention-backend torch_sdpaorfadepending on what is available in your environment. - MPS: the platform implementation always uses
torch_sdpa.