Neko Legends local voice profile

Fish Audio S2-Pro Realtime Optimized for RTX 5090

model-card-only

A no-weights release documenting a local RTX 5090 serving profile for Fish Audio S2-Pro: Docker-backed SGLang Omni, CUDA graphs, cached reference audio, and a graph-safe decoder fallback for SM120.

Base modelfishaudio/s2-pro
Target GPURTX 5090
ServerSGLang Omni
Weightsnot hosted
Goalrealtime TTS
First audio0.36s

This repository does not redistribute Fish Audio S2-Pro weights. Users must accept the upstream Fish Audio license and download the official checkpoint from fishaudio/s2-pro.

What This Is

Runtime profile

A local serving recipe for Fish Audio S2-Pro using Docker-backed SGLang Omni on a 32 GB RTX 5090.

No model fork

No fine-tuning, quantization, checkpoint conversion, or rehosted upstream model files are claimed here.

Live setup

Designed for single-user realtime local TTS. A separate GPU can handle realtime STT/ASR in a split workstation setup.

Why There Are No Weights Here

The local checkpoint files match the official upstream fishaudio/s2-pro release. To avoid duplicating gated model files or confusing the license boundary, this repository does not upload:

  • model-00001-of-00002.safetensors
  • model-00002-of-00002.safetensors
  • codec.pth
  • tokenizer/config files from the upstream model

Download the official files from:

https://huggingface.co/fishaudio/s2-pro

Local Benchmark Summary

Measured locally on an RTX 5090. Speedup is the headline; raw before/after values are included for reproducibility.

RTX 5090 before/after serving benchmark
RTX 5090 benchmark bar chart for Fish Audio S2-Pro realtime serving
Metric Before: Python Fish server After: SGLang Omni + cached reference Speedup
First audio 25.1s 0.36s 69.9x faster, about 6,890% faster
Total request time 25.1s 2.10s 12.0x faster, about 1,100% faster
Estimated RTF 5.51 0.48 11.5x faster, about 1,050% faster

Benchmark notes:

  • Baseline was the local Python Fish server path.
  • Optimized path was the Docker-backed SGLang Omni path with a warm server and cached reference voice.
  • Warm cached SGLang generation was below realtime for the measured short live TTS sample.
  • The chart is a local serving benchmark, not an upstream Fish Audio benchmark claim.

Optimization Profile

The measured realtime profile used:

  • SGLang Omni instead of the eager Python Fish server path.
  • The Docker image's pinned Torch/SGLang/FlashInfer stack.
  • SGLang CUDA graph replay enabled.
  • RTX 5090 / SM120-safe Fish audio-decoder path by disabling the incompatible sgl-kernel KV-cache flash-attention path.
  • Graph-safe fixed-cache SDPA fallback for the Fish audio decoder.
  • flashinfer text attention backend.
  • Single-user live memory profile:
    • mem_fraction_static=0.50
    • chunked_prefill_size=2048
    • max_running_requests=4
  • Preloaded/cached reference VQ codes for repeated voice-reference requests.
  • Docker model/runtime volumes to avoid repeated slow checkpoint reads through Windows /mnt/d bind mounts.

Measured local VRAM after the tuned live restart was about 24.6 GB on the RTX 5090, down from an earlier near-full 32.2 GB SGLang container reservation.

Recommended RTX 5090 Environment

FISH_S2PRO_DISABLE_FLASH_ATTN_KVCACHE=1
FISH_S2PRO_KVCACHE_FALLBACK_BACKEND=sdpa
FISH_S2PRO_SGLANG_ATTENTION_BACKEND=flashinfer
FISH_S2PRO_SGLANG_DISABLE_CUDA_GRAPH=0
FISH_S2PRO_SGLANG_MEM_FRACTION_STATIC=0.50
FISH_S2PRO_SGLANG_CHUNKED_PREFILL_SIZE=2048
FISH_S2PRO_SGLANG_MAX_RUNNING_REQUESTS=4
FISH_S2PRO_PRELOAD_REFERENCE_PATHS=/path/to/reference.wav

Recommended serve shape:

sgl-omni serve \
  --model-path /path/to/s2-pro \
  --config examples/configs/s2pro_tts.yaml \
  --host 0.0.0.0 \
  --port 8000

For AI Agents

If a user asks you to apply the RTX 5090 realtime optimizations from this model card, treat this repository as documentation for a local runtime setup. Do not edit, quantize, rehost, or replace the Fish Audio S2-Pro weights.

Agent checklist:

  1. Confirm the user has accepted the upstream Fish Audio license and has the official fishaudio/s2-pro checkpoint locally.
  2. If weights are missing, instruct the user to download them from https://huggingface.co/fishaudio/s2-pro; do not download weights from this repository.
  3. Locate or clone a writable sgl-project/sglang-omni checkout.
  4. Preserve the serving image's pinned Torch/SGLang/FlashInfer stack. If you need an editable local checkout inside Docker, install it without replacing the pinned dependency stack.
  5. Apply the runtime optimizations to the local SGLang Omni FishAudio S2-Pro implementation, not to the model checkpoint.

Patch targets and intent:

  • In sglang_omni/models/fishaudio_s2_pro/fish_speech/models/text2semantic/modeling.py: add FISH_S2PRO_DISABLE_FLASH_ATTN_KVCACHE and FISH_S2PRO_KVCACHE_FALLBACK_BACKEND.
  • In that same file, when FISH_S2PRO_DISABLE_FLASH_ATTN_KVCACHE=1, bypass the sgl-kernel / flash_attn_with_kvcache path for the Fish audio decoder and use a graph-safe fixed-cache SDPA fallback.
  • The fallback should update the K/V cache at the current cache_seqlens, repeat KV heads to match query heads, mask positions beyond cache_seqlens, call torch.nn.functional.scaled_dot_product_attention, and return the normal attention output shape.
  • Add non-persistent batch_indices and key_positions buffers to the Fish decoder KVCache helper so the fallback is CUDA-graph friendly.
  • In sglang_omni/models/fishaudio_s2_pro/pipeline/stages.py: make SGLang ServerArgs configurable from env vars: FISH_S2PRO_SGLANG_ATTENTION_BACKEND, FISH_S2PRO_SGLANG_DISABLE_CUDA_GRAPH, FISH_S2PRO_SGLANG_MEM_FRACTION_STATIC, FISH_S2PRO_SGLANG_CHUNKED_PREFILL_SIZE, and FISH_S2PRO_SGLANG_MAX_RUNNING_REQUESTS.
  • In that same preprocessing stage, add an in-process reference VQ cache keyed by reference audio path and local file signature. Preload any paths from FISH_S2PRO_PRELOAD_REFERENCE_PATHS, and use cached VQ codes for repeated audio_path references.

Docker/runtime notes for agents:

  • Use NVIDIA GPU passthrough and select the RTX 5090 explicitly when multiple GPUs are visible.
  • Prefer Docker volumes or Linux-native storage for the model and runtime cache; repeated checkpoint reads through a Windows /mnt/d bind mount can add many minutes to cold startup.
  • Keep CUDA graphs enabled unless debugging compatibility.
  • Do not switch to trtllm_mha on RTX 5090 / SM120 unless upstream support has changed; local tests rejected that backend for this GPU.
  • Benchmark after every change with the same prompt and reference. Report first-audio latency, total request time, estimated audio duration, RTF, VRAM, and whether the run was cold, first-after-health, or warm cached.
  • Expected warm target on the measured machine: first audio under 1 second and RTF below 1.0 for the short live TTS benchmark.

License

The base model is governed by the Fish Audio Research License. Research and non-commercial use are permitted by Fish Audio under that license. Commercial use requires a separate license from Fish Audio.

This repository does not grant additional rights to the Fish Audio model weights.

Attribution

Built with Fish Audio S2-Pro. Fish Audio S2-Pro is developed by Fish Audio / 39 AI, INC.

Upstream model:

fishaudio/s2-pro
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for neko-legends/fish-audio-s2-realtime-optimized-5090

Base model

fishaudio/s2-pro
Finetuned
(10)
this model