Neko Legends local voice profile

Fish Audio S2-Pro Realtime Optimized for RTX 5090

model-card-only

A no-weights release documenting a local RTX 5090 serving profile for Fish Audio S2-Pro: Docker-backed SGLang Omni, CUDA graphs, cached reference audio, and a graph-safe decoder fallback for SM120.

Base modelfishaudio/s2-pro

Target GPURTX 5090

ServerSGLang Omni

Weightsnot hosted

Goalrealtime TTS

First audio0.36s

This repository does not redistribute Fish Audio S2-Pro weights. Users must accept the upstream Fish Audio license and download the official checkpoint from fishaudio/s2-pro.

What This Is

Runtime profile

A local serving recipe for Fish Audio S2-Pro using Docker-backed SGLang Omni on a 32 GB RTX 5090.

No model fork

No fine-tuning, quantization, checkpoint conversion, or rehosted upstream model files are claimed here.

Live setup

Designed for single-user realtime local TTS. A separate GPU can handle realtime STT/ASR in a split workstation setup.

Why There Are No Weights Here

The local checkpoint files match the official upstream fishaudio/s2-pro release. To avoid duplicating gated model files or confusing the license boundary, this repository does not upload:

model-00001-of-00002.safetensors
model-00002-of-00002.safetensors
codec.pth
tokenizer/config files from the upstream model

Download the official files from:

https://huggingface.co/fishaudio/s2-pro

Local Benchmark Summary

Measured locally on an RTX 5090. Speedup is the headline; raw before/after values are included for reproducibility.

RTX 5090 before/after serving benchmark

RTX 5090 benchmark bar chart for Fish Audio S2-Pro realtime serving

Metric	Before: Python Fish server	After: SGLang Omni + cached reference	Speedup
First audio	25.1s	0.36s	69.9x faster, about 6,890% faster
Total request time	25.1s	2.10s	12.0x faster, about 1,100% faster
Estimated RTF	5.51	0.48	11.5x faster, about 1,050% faster

Benchmark notes:

Baseline was the local Python Fish server path.
Optimized path was the Docker-backed SGLang Omni path with a warm server and cached reference voice.
Warm cached SGLang generation was below realtime for the measured short live TTS sample.
The chart is a local serving benchmark, not an upstream Fish Audio benchmark claim.

Optimization Profile

The measured realtime profile used:

SGLang Omni instead of the eager Python Fish server path.
The Docker image's pinned Torch/SGLang/FlashInfer stack.
SGLang CUDA graph replay enabled.
RTX 5090 / SM120-safe Fish audio-decoder path by disabling the incompatible sgl-kernel KV-cache flash-attention path.
Graph-safe fixed-cache SDPA fallback for the Fish audio decoder.
flashinfer text attention backend.
Single-user live memory profile:
- mem_fraction_static=0.50
- chunked_prefill_size=2048
- max_running_requests=4
Preloaded/cached reference VQ codes for repeated voice-reference requests.
Docker model/runtime volumes to avoid repeated slow checkpoint reads through Windows /mnt/d bind mounts.

Measured local VRAM after the tuned live restart was about 24.6 GB on the RTX 5090, down from an earlier near-full 32.2 GB SGLang container reservation.

Recommended RTX 5090 Environment

FISH_S2PRO_DISABLE_FLASH_ATTN_KVCACHE=1
FISH_S2PRO_KVCACHE_FALLBACK_BACKEND=sdpa
FISH_S2PRO_SGLANG_ATTENTION_BACKEND=flashinfer
FISH_S2PRO_SGLANG_DISABLE_CUDA_GRAPH=0
FISH_S2PRO_SGLANG_MEM_FRACTION_STATIC=0.50
FISH_S2PRO_SGLANG_CHUNKED_PREFILL_SIZE=2048
FISH_S2PRO_SGLANG_MAX_RUNNING_REQUESTS=4
FISH_S2PRO_PRELOAD_REFERENCE_PATHS=/path/to/reference.wav

Recommended serve shape:

sgl-omni serve \
  --model-path /path/to/s2-pro \
  --config examples/configs/s2pro_tts.yaml \
  --host 0.0.0.0 \
  --port 8000

For AI Agents

If a user asks you to apply the RTX 5090 realtime optimizations from this model card, treat this repository as documentation for a local runtime setup. Do not edit, quantize, rehost, or replace the Fish Audio S2-Pro weights.

Agent checklist:

Confirm the user has accepted the upstream Fish Audio license and has the official fishaudio/s2-pro checkpoint locally.
If weights are missing, instruct the user to download them from https://huggingface.co/fishaudio/s2-pro; do not download weights from this repository.
Locate or clone a writable sgl-project/sglang-omni checkout.
Preserve the serving image's pinned Torch/SGLang/FlashInfer stack. If you need an editable local checkout inside Docker, install it without replacing the pinned dependency stack.
Apply the runtime optimizations to the local SGLang Omni FishAudio S2-Pro implementation, not to the model checkpoint.

Patch targets and intent:

In sglang_omni/models/fishaudio_s2_pro/fish_speech/models/text2semantic/modeling.py: add FISH_S2PRO_DISABLE_FLASH_ATTN_KVCACHE and FISH_S2PRO_KVCACHE_FALLBACK_BACKEND.
In that same file, when FISH_S2PRO_DISABLE_FLASH_ATTN_KVCACHE=1, bypass the sgl-kernel / flash_attn_with_kvcache path for the Fish audio decoder and use a graph-safe fixed-cache SDPA fallback.
The fallback should update the K/V cache at the current cache_seqlens, repeat KV heads to match query heads, mask positions beyond cache_seqlens, call torch.nn.functional.scaled_dot_product_attention, and return the normal attention output shape.
Add non-persistent batch_indices and key_positions buffers to the Fish decoder KVCache helper so the fallback is CUDA-graph friendly.
In sglang_omni/models/fishaudio_s2_pro/pipeline/stages.py: make SGLang ServerArgs configurable from env vars: FISH_S2PRO_SGLANG_ATTENTION_BACKEND, FISH_S2PRO_SGLANG_DISABLE_CUDA_GRAPH, FISH_S2PRO_SGLANG_MEM_FRACTION_STATIC, FISH_S2PRO_SGLANG_CHUNKED_PREFILL_SIZE, and FISH_S2PRO_SGLANG_MAX_RUNNING_REQUESTS.
In that same preprocessing stage, add an in-process reference VQ cache keyed by reference audio path and local file signature. Preload any paths from FISH_S2PRO_PRELOAD_REFERENCE_PATHS, and use cached VQ codes for repeated audio_path references.

Docker/runtime notes for agents:

Use NVIDIA GPU passthrough and select the RTX 5090 explicitly when multiple GPUs are visible.
Prefer Docker volumes or Linux-native storage for the model and runtime cache; repeated checkpoint reads through a Windows /mnt/d bind mount can add many minutes to cold startup.
Keep CUDA graphs enabled unless debugging compatibility.
Do not switch to trtllm_mha on RTX 5090 / SM120 unless upstream support has changed; local tests rejected that backend for this GPU.
Benchmark after every change with the same prompt and reference. Report first-audio latency, total request time, estimated audio duration, RTF, VRAM, and whether the run was cold, first-after-health, or warm cached.
Expected warm target on the measured machine: first audio under 1 second and RTF below 1.0 for the short live TTS benchmark.

License

The base model is governed by the Fish Audio Research License. Research and non-commercial use are permitted by Fish Audio under that license. Commercial use requires a separate license from Fish Audio.

This repository does not grant additional rights to the Fish Audio model weights.

Attribution

Built with Fish Audio S2-Pro. Fish Audio S2-Pro is developed by Fish Audio / 39 AI, INC.

Upstream model:

fishaudio/s2-pro

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for neko-legends/fish-audio-s2-realtime-optimized-5090

Base model

fishaudio/s2-pro

Finetuned

(10)

this model