This repository does not redistribute Fish Audio S2-Pro weights. Users must accept the upstream Fish Audio license and download the official checkpoint from
fishaudio/s2-pro.
What This Is
A local serving recipe for Fish Audio S2-Pro using Docker-backed SGLang Omni on a 32 GB RTX 5090.
No fine-tuning, quantization, checkpoint conversion, or rehosted upstream model files are claimed here.
Designed for single-user realtime local TTS. A separate GPU can handle realtime STT/ASR in a split workstation setup.
Why There Are No Weights Here
The local checkpoint files match the official upstream fishaudio/s2-pro release. To avoid duplicating gated model files or confusing the license boundary, this repository does not upload:
model-00001-of-00002.safetensorsmodel-00002-of-00002.safetensorscodec.pth- tokenizer/config files from the upstream model
Download the official files from:
https://huggingface.co/fishaudio/s2-pro
Local Benchmark Summary
Measured locally on an RTX 5090. Speedup is the headline; raw before/after values are included for reproducibility.
| Metric | Before: Python Fish server | After: SGLang Omni + cached reference | Speedup |
|---|---|---|---|
| First audio | 25.1s | 0.36s | 69.9x faster, about 6,890% faster |
| Total request time | 25.1s | 2.10s | 12.0x faster, about 1,100% faster |
| Estimated RTF | 5.51 | 0.48 | 11.5x faster, about 1,050% faster |
Benchmark notes:
- Baseline was the local Python Fish server path.
- Optimized path was the Docker-backed SGLang Omni path with a warm server and cached reference voice.
- Warm cached SGLang generation was below realtime for the measured short live TTS sample.
- The chart is a local serving benchmark, not an upstream Fish Audio benchmark claim.
Optimization Profile
The measured realtime profile used:
- SGLang Omni instead of the eager Python Fish server path.
- The Docker image's pinned Torch/SGLang/FlashInfer stack.
- SGLang CUDA graph replay enabled.
- RTX 5090 / SM120-safe Fish audio-decoder path by disabling the incompatible
sgl-kernelKV-cache flash-attention path. - Graph-safe fixed-cache SDPA fallback for the Fish audio decoder.
flashinfertext attention backend.- Single-user live memory profile:
mem_fraction_static=0.50chunked_prefill_size=2048max_running_requests=4
- Preloaded/cached reference VQ codes for repeated voice-reference requests.
- Docker model/runtime volumes to avoid repeated slow checkpoint reads through Windows
/mnt/dbind mounts.
Measured local VRAM after the tuned live restart was about 24.6 GB on the RTX 5090, down from an earlier near-full 32.2 GB SGLang container reservation.
Recommended RTX 5090 Environment
FISH_S2PRO_DISABLE_FLASH_ATTN_KVCACHE=1
FISH_S2PRO_KVCACHE_FALLBACK_BACKEND=sdpa
FISH_S2PRO_SGLANG_ATTENTION_BACKEND=flashinfer
FISH_S2PRO_SGLANG_DISABLE_CUDA_GRAPH=0
FISH_S2PRO_SGLANG_MEM_FRACTION_STATIC=0.50
FISH_S2PRO_SGLANG_CHUNKED_PREFILL_SIZE=2048
FISH_S2PRO_SGLANG_MAX_RUNNING_REQUESTS=4
FISH_S2PRO_PRELOAD_REFERENCE_PATHS=/path/to/reference.wav
Recommended serve shape:
sgl-omni serve \
--model-path /path/to/s2-pro \
--config examples/configs/s2pro_tts.yaml \
--host 0.0.0.0 \
--port 8000
For AI Agents
If a user asks you to apply the RTX 5090 realtime optimizations from this model card, treat this repository as documentation for a local runtime setup. Do not edit, quantize, rehost, or replace the Fish Audio S2-Pro weights.
Agent checklist:
- Confirm the user has accepted the upstream Fish Audio license and has the official
fishaudio/s2-procheckpoint locally. - If weights are missing, instruct the user to download them from
https://huggingface.co/fishaudio/s2-pro; do not download weights from this repository. - Locate or clone a writable
sgl-project/sglang-omnicheckout. - Preserve the serving image's pinned Torch/SGLang/FlashInfer stack. If you need an editable local checkout inside Docker, install it without replacing the pinned dependency stack.
- Apply the runtime optimizations to the local SGLang Omni FishAudio S2-Pro implementation, not to the model checkpoint.
Patch targets and intent:
- In
sglang_omni/models/fishaudio_s2_pro/fish_speech/models/text2semantic/modeling.py: addFISH_S2PRO_DISABLE_FLASH_ATTN_KVCACHEandFISH_S2PRO_KVCACHE_FALLBACK_BACKEND. - In that same file, when
FISH_S2PRO_DISABLE_FLASH_ATTN_KVCACHE=1, bypass thesgl-kernel/flash_attn_with_kvcachepath for the Fish audio decoder and use a graph-safe fixed-cache SDPA fallback. - The fallback should update the K/V cache at the current
cache_seqlens, repeat KV heads to match query heads, mask positions beyondcache_seqlens, calltorch.nn.functional.scaled_dot_product_attention, and return the normal attention output shape. - Add non-persistent
batch_indicesandkey_positionsbuffers to the Fish decoderKVCachehelper so the fallback is CUDA-graph friendly. - In
sglang_omni/models/fishaudio_s2_pro/pipeline/stages.py: make SGLangServerArgsconfigurable from env vars:FISH_S2PRO_SGLANG_ATTENTION_BACKEND,FISH_S2PRO_SGLANG_DISABLE_CUDA_GRAPH,FISH_S2PRO_SGLANG_MEM_FRACTION_STATIC,FISH_S2PRO_SGLANG_CHUNKED_PREFILL_SIZE, andFISH_S2PRO_SGLANG_MAX_RUNNING_REQUESTS. - In that same preprocessing stage, add an in-process reference VQ cache keyed by reference audio path and local file signature. Preload any paths from
FISH_S2PRO_PRELOAD_REFERENCE_PATHS, and use cached VQ codes for repeatedaudio_pathreferences.
Docker/runtime notes for agents:
- Use NVIDIA GPU passthrough and select the RTX 5090 explicitly when multiple GPUs are visible.
- Prefer Docker volumes or Linux-native storage for the model and runtime cache; repeated checkpoint reads through a Windows
/mnt/dbind mount can add many minutes to cold startup. - Keep CUDA graphs enabled unless debugging compatibility.
- Do not switch to
trtllm_mhaon RTX 5090 / SM120 unless upstream support has changed; local tests rejected that backend for this GPU. - Benchmark after every change with the same prompt and reference. Report first-audio latency, total request time, estimated audio duration, RTF, VRAM, and whether the run was cold, first-after-health, or warm cached.
- Expected warm target on the measured machine: first audio under 1 second and RTF below 1.0 for the short live TTS benchmark.
License
The base model is governed by the Fish Audio Research License. Research and non-commercial use are permitted by Fish Audio under that license. Commercial use requires a separate license from Fish Audio.
This repository does not grant additional rights to the Fish Audio model weights.
Attribution
Built with Fish Audio S2-Pro. Fish Audio S2-Pro is developed by Fish Audio / 39 AI, INC.
Upstream model:
fishaudio/s2-pro
Model tree for neko-legends/fish-audio-s2-realtime-optimized-5090
Base model
fishaudio/s2-pro