Text-to-Speech
ONNX
fastkokoro
kokoro
streaming
fp16

Kokoro-82M Streaming ONNX

Fixed-bucket ONNX exports of hexgrad/Kokoro-82M for low-latency local streaming TTS with FastKokoro and ONNX Runtime.

These checkpoints use static token buckets, length-aware padding masks, decoder FP16 weights, opset 17, and graph patches that remove the previous dynamic tail and scatter-heavy export patterns. The current family uses 480 output samples per predicted alignment frame plus a 3120-sample fixed tail margin.

TensorRT EP with engine/timing cache is the recommended GPU runtime. The models also run with CUDA EP and CPU EP, but the latency expectations below are from TensorRT cache-hit measurements.

Files

onnx/kokoro-82m-streaming-b64-fp16.onnx is the recommended default because it prioritizes smoother short-phrase continuity while still measuring about 105 ms cache-hit model-call latency with TensorRT EP on the measured GTX 1650 host. b48 remains the lower-latency balanced option, b24 is available for extremely low latency when 2-3 word chunks are acceptable, and b16 is experimental and should be treated as a single-word checkpoint.

File Bucket Intended use
onnx/kokoro-82m-streaming-b16-fp16.onnx 16 Lowest latency, single words only
onnx/kokoro-82m-streaming-b24-fp16.onnx 24 Ultra-low-latency opt-in
onnx/kokoro-82m-streaming-b32-fp16.onnx 32 Short phrases
onnx/kokoro-82m-streaming-b48-fp16.onnx 48 Balanced low-latency option
onnx/kokoro-82m-streaming-b64-fp16.onnx 64 Recommended default
onnx/kokoro-82m-streaming-b96-fp16.onnx 96 Recommended large bucket
onnx/kokoro-82m-streaming-b128-fp16.onnx 128 Optional long-continuity bucket
voices.npz - FastKokoro voice/style embeddings
voices.txt - Voice index compatible with Kokoro/NVIDIA voice ordering

Model Inputs and Outputs

Each bucket has the same tensor contract; only token width, duration width, and fixed waveform length change.

Tensor Shape Type Description
input_ids [1, bucket] int64 Token IDs padded to the selected bucket
style [1, 256] float32 Voice/style embedding from voices.npz
speed [1] float32 Speed factor, usually 1.0
input_lengths [1] int64 Number of valid tokens before padding
Output Shape Type Description
waveform [samples] float32 Raw 24 kHz mono waveform
duration [bucket] int64 Predicted token durations

Bucket Guidance

Two token positions are reserved by the model, so usable text capacity is bucket - 2 phoneme tokens. Practical word capacity is lower and depends on language, punctuation, voice, and speed because the model also predicts duration. The table below is the observed safe expectation from English/Portuguese probes at speed 0.85; speed 1.0 can fit roughly one extra word in some buckets. The default sentence strategy uses this conservative capacity before falling back to the model's real phonemized token width, so long sentences are split before late words land at the tail of a near-full fixed-output window.

Bucket Usable tokens Alignment frames Output samples Expected words TensorRT p50 Notes
16 14 40 22,320 1-2 33 ms Experimental; single words only
24 22 56 30,000 3 45 ms Ultra-low-latency opt-in
32 30 72 37,680 4-5 54 ms Short phrases
48 46 104 53,040 6 76 ms Balanced low-latency option
64 62 136 68,400 6 105 ms Recommended default
96 94 200 99,120 14 134 ms Recommended large bucket
128 126 264 129,840 18 272 ms Optional long-continuity bucket

Latency was measured on a GTX 1650 (SM75), TensorRT 10.11 / ONNX Runtime 1.22, with TensorRT engine and timing cache already built. First-time TensorRT engine builds took roughly 2-4 minutes per bucket on that host.

Quickstart

This repository is intended to be used with msgflux/fastkokoro, an OpenAI-compatible local TTS server for Kokoro ONNX models.

Install FastKokoro with an ONNX Runtime backend:

uv pip install "fastkokoro[gpu]"

For CPU-only environments:

uv pip install "fastkokoro[cpu]"

Start the server:

fastkokoro

Generate speech through the OpenAI-compatible endpoint:

curl http://localhost:8880/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "kokoro",
    "voice": "pf_dora",
    "input": "Oi, tudo bem? Este audio esta sendo gerado com o bucket streaming b64.",
    "response_format": "pcm"
  }' \
  --output speech.pcm

Configuration

Recent FastKokoro releases use this repository as the default streaming model. If you need to configure it explicitly, set:

FASTKOKORO_MODEL_REPO=msgflux/Kokoro-82M-streaming-onnx
FASTKOKORO_MODEL_FILE=onnx/kokoro-82m-streaming-b64-fp16.onnx
FASTKOKORO_VOICES_FILE=voices.npz

TensorRT EP with CUDA and CPU fallback:

FASTKOKORO_ONNX_PROVIDERS=TensorrtExecutionProvider,CUDAExecutionProvider,CPUExecutionProvider
FASTKOKORO_ONNX_PROVIDER_OPTIONS='{"TensorrtExecutionProvider":{"trt_engine_cache_enable":"True","trt_engine_cache_path":"/models/trt-cache","trt_timing_cache_enable":"True","trt_timing_cache_path":"/models/trt-cache"}}'

CUDA EP fallback:

FASTKOKORO_ONNX_PROVIDERS=CUDAExecutionProvider,CPUExecutionProvider

To choose a different latency/continuity tradeoff:

FASTKOKORO_MODEL_FILE=onnx/kokoro-82m-streaming-b16-fp16.onnx
FASTKOKORO_MODEL_FILE=onnx/kokoro-82m-streaming-b24-fp16.onnx
FASTKOKORO_MODEL_FILE=onnx/kokoro-82m-streaming-b32-fp16.onnx
FASTKOKORO_MODEL_FILE=onnx/kokoro-82m-streaming-b48-fp16.onnx
FASTKOKORO_MODEL_FILE=onnx/kokoro-82m-streaming-b64-fp16.onnx
FASTKOKORO_MODEL_FILE=onnx/kokoro-82m-streaming-b96-fp16.onnx
FASTKOKORO_MODEL_FILE=onnx/kokoro-82m-streaming-b128-fp16.onnx

Use response_format: "pcm" for the lowest-overhead streaming response. Use wav when you need a self-contained audio file with headers. Pass lang in each request to force a specific language or locale, for example pt-br, en-us, ja-jp, or es-es.

FastKokoro uses FASTKOKORO_STREAM_STRATEGY=sentence by default. It synthesizes one sentence at a time and only splits further when the loaded ONNX bucket's real phonemized token width requires it. adaptive and chunk remain available for lower first-audio latency, but they use scheduled word-boundary chunks and can sound less natural. FASTKOKORO_STREAM_MAX_SEGMENT_WORDS and FASTKOKORO_STREAM_MAX_SEGMENT_CHARS are intentionally unset unless you want an explicit user override for scheduled strategies; static buckets still cap the request to their safe token-width capacity. PCM streaming inserts FASTKOKORO_STREAM_BOUNDARY_SILENCE_MS=80 between adjacent generated text segments by default, which keeps split transitions from sounding glued together. Explicit [pause:...] tags control their own silence.

Export Recipe

The current checkpoints were exported with PyTorch 2.5.1, Transformers 4.48.3, opset 17, legacy ONNX export, decoder-only FP16, and Albert kept in FP32. Do not use global --precision fp16; it converts Albert to FP16 and can break downstream models.

For a bucket B. The default release bucket is 64.

B=64
ALIGN=$((2 * B + 8))
SAMPLES=$((ALIGN * 480 + 3120))

uv run \
  --with torch==2.5.1 \
  --with transformers==4.48.3 \
  --with onnx \
  --with numpy \
  --with huggingface-hub \
  --with loguru \
  --with 'misaki[en]>=0.9.4' \
  python scripts/export_kokoro_torch_ttfc.py \
    --kokoro-repo demo-output/reexport/hexgrad-kokoro \
    --output "demo-output/reexport/family-frame480-margin3120/kokoro-82m-streaming-b${B}-decoder-fp16-frame480-margin3120.onnx" \
    --bucket "$B" \
    --fixed-alignment-frames "$ALIGN" \
    --fixed-output-samples "$SAMPLES" \
    --output-samples-per-frame 480 \
    --output-tail-margin-samples 3120 \
    --precision decoder-fp16 \
    --opset 17 \
    --legacy-export \
    --length-aware \
    --patch-fixed-lstm \
    --patch-scatterless-sine-source \
    --patch-split-adain \
    --patch-albert-sdpa-bool-mask-scale \
    --device cuda

All exported checkpoints in this family have 2707 ONNX nodes and 593 initializers.

Opset 24 / Attention Note

TensorRT 10.16 can parse ONNX opset 24 Attention when --stronglyTyped is enabled, but TensorRT's native IAttention layer requires GPUs with CUDA capability 8.0 or higher. To keep the published checkpoints usable on older hardware such as Turing/SM75, this repository intentionally publishes the opset 17 graph family instead of an opset 24 Attention export.

Voices and Languages

The included voice list follows Kokoro's language-prefix convention.

Language Request lang values Voices
American English a, en-us, en af_*, am_*
British English b, en-gb bf_*, bm_*
Japanese j, ja, ja-jp jf_*, jm_*
Mandarin Chinese z, zh, zh-cn zf_*, zm_*
Spanish e, es, es-es ef_dora, em_alex
French f, fr, fr-fr ff_siwis
Hindi h, hi, hi-in hf_*, hm_*
Italian i, it, it-it if_sara, im_nicola
Brazilian Portuguese p, pt, pt-br pf_dora, pm_alex, pm_santa

See voices.txt for the exact voice ordering.

References

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for msgflux/Kokoro-82M-streaming-onnx

Quantized
(50)
this model

Papers for msgflux/Kokoro-82M-streaming-onnx