Kokoro-82M Streaming ONNX

Fixed-bucket ONNX exports of hexgrad/Kokoro-82M for low-latency local streaming TTS with FastKokoro and ONNX Runtime.

These checkpoints use static token buckets, length-aware padding masks, decoder FP16 weights, opset 17, and graph patches that remove the previous dynamic tail and scatter-heavy export patterns. The current family uses 480 output samples per predicted alignment frame plus a 3120-sample fixed tail margin.

TensorRT EP with engine/timing cache is the recommended GPU runtime. The models also run with CUDA EP and CPU EP, but the latency expectations below are from TensorRT cache-hit measurements.

Files

onnx/kokoro-82m-streaming-b64-fp16.onnx is the recommended default because it prioritizes smoother short-phrase continuity while still measuring about 105 ms cache-hit model-call latency with TensorRT EP on the measured GTX 1650 host. b48 remains the lower-latency balanced option, b24 is available for extremely low latency when 2-3 word chunks are acceptable, and b16 is experimental and should be treated as a single-word checkpoint.

File	Bucket	Intended use
`onnx/kokoro-82m-streaming-b16-fp16.onnx`	16	Lowest latency, single words only
`onnx/kokoro-82m-streaming-b24-fp16.onnx`	24	Ultra-low-latency opt-in
`onnx/kokoro-82m-streaming-b32-fp16.onnx`	32	Short phrases
`onnx/kokoro-82m-streaming-b48-fp16.onnx`	48	Balanced low-latency option
`onnx/kokoro-82m-streaming-b64-fp16.onnx`	64	Recommended default
`onnx/kokoro-82m-streaming-b96-fp16.onnx`	96	Recommended large bucket
`onnx/kokoro-82m-streaming-b128-fp16.onnx`	128	Optional long-continuity bucket
`voices.npz`	-	FastKokoro voice/style embeddings
`voices.txt`	-	Voice index compatible with Kokoro/NVIDIA voice ordering

Model Inputs and Outputs

Each bucket has the same tensor contract; only token width, duration width, and fixed waveform length change.

Tensor	Shape	Type	Description
`input_ids`	`[1, bucket]`	`int64`	Token IDs padded to the selected bucket
`style`	`[1, 256]`	`float32`	Voice/style embedding from `voices.npz`
`speed`	`[1]`	`float32`	Speed factor, usually `1.0`
`input_lengths`	`[1]`	`int64`	Number of valid tokens before padding

Output	Shape	Type	Description
`waveform`	`[samples]`	`float32`	Raw 24 kHz mono waveform
`duration`	`[bucket]`	`int64`	Predicted token durations

Bucket Guidance

Two token positions are reserved by the model, so usable text capacity is bucket - 2 phoneme tokens. Practical word capacity is lower and depends on language, punctuation, voice, and speed because the model also predicts duration. The table below is the observed safe expectation from English/Portuguese probes at speed 0.85; speed 1.0 can fit roughly one extra word in some buckets. The default sentence strategy uses this conservative capacity before falling back to the model's real phonemized token width, so long sentences are split before late words land at the tail of a near-full fixed-output window.

Bucket	Usable tokens	Alignment frames	Output samples	Expected words	TensorRT p50	Notes
16	14	40	22,320	1-2	33 ms	Experimental; single words only
24	22	56	30,000	3	45 ms	Ultra-low-latency opt-in
32	30	72	37,680	4-5	54 ms	Short phrases
48	46	104	53,040	6	76 ms	Balanced low-latency option
64	62	136	68,400	6	105 ms	Recommended default
96	94	200	99,120	14	134 ms	Recommended large bucket
128	126	264	129,840	18	272 ms	Optional long-continuity bucket

Latency was measured on a GTX 1650 (SM75), TensorRT 10.11 / ONNX Runtime 1.22, with TensorRT engine and timing cache already built. First-time TensorRT engine builds took roughly 2-4 minutes per bucket on that host.

Quickstart

This repository is intended to be used with msgflux/fastkokoro, an OpenAI-compatible local TTS server for Kokoro ONNX models.

Install FastKokoro with an ONNX Runtime backend:

uv pip install "fastkokoro[gpu]"

For CPU-only environments:

uv pip install "fastkokoro[cpu]"

Start the server:

fastkokoro

Generate speech through the OpenAI-compatible endpoint:

curl http://localhost:8880/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "kokoro",
    "voice": "pf_dora",
    "input": "Oi, tudo bem? Este audio esta sendo gerado com o bucket streaming b64.",
    "response_format": "pcm"
  }' \
  --output speech.pcm

Configuration

Recent FastKokoro releases use this repository as the default streaming model. If you need to configure it explicitly, set:

FASTKOKORO_MODEL_REPO=msgflux/Kokoro-82M-streaming-onnx
FASTKOKORO_MODEL_FILE=onnx/kokoro-82m-streaming-b64-fp16.onnx
FASTKOKORO_VOICES_FILE=voices.npz

TensorRT EP with CUDA and CPU fallback:

FASTKOKORO_ONNX_PROVIDERS=TensorrtExecutionProvider,CUDAExecutionProvider,CPUExecutionProvider
FASTKOKORO_ONNX_PROVIDER_OPTIONS='{"TensorrtExecutionProvider":{"trt_engine_cache_enable":"True","trt_engine_cache_path":"/models/trt-cache","trt_timing_cache_enable":"True","trt_timing_cache_path":"/models/trt-cache"}}'

CUDA EP fallback:

FASTKOKORO_ONNX_PROVIDERS=CUDAExecutionProvider,CPUExecutionProvider

To choose a different latency/continuity tradeoff:

FASTKOKORO_MODEL_FILE=onnx/kokoro-82m-streaming-b16-fp16.onnx
FASTKOKORO_MODEL_FILE=onnx/kokoro-82m-streaming-b24-fp16.onnx
FASTKOKORO_MODEL_FILE=onnx/kokoro-82m-streaming-b32-fp16.onnx
FASTKOKORO_MODEL_FILE=onnx/kokoro-82m-streaming-b48-fp16.onnx
FASTKOKORO_MODEL_FILE=onnx/kokoro-82m-streaming-b64-fp16.onnx
FASTKOKORO_MODEL_FILE=onnx/kokoro-82m-streaming-b96-fp16.onnx
FASTKOKORO_MODEL_FILE=onnx/kokoro-82m-streaming-b128-fp16.onnx

Use response_format: "pcm" for the lowest-overhead streaming response. Use wav when you need a self-contained audio file with headers. Pass lang in each request to force a specific language or locale, for example pt-br, en-us, ja-jp, or es-es.

FastKokoro uses FASTKOKORO_STREAM_STRATEGY=sentence by default. It synthesizes one sentence at a time and only splits further when the loaded ONNX bucket's real phonemized token width requires it. adaptive and chunk remain available for lower first-audio latency, but they use scheduled word-boundary chunks and can sound less natural. FASTKOKORO_STREAM_MAX_SEGMENT_WORDS and FASTKOKORO_STREAM_MAX_SEGMENT_CHARS are intentionally unset unless you want an explicit user override for scheduled strategies; static buckets still cap the request to their safe token-width capacity. PCM streaming inserts FASTKOKORO_STREAM_BOUNDARY_SILENCE_MS=80 between adjacent generated text segments by default, which keeps split transitions from sounding glued together. Explicit [pause:...] tags control their own silence.

Export Recipe

The current checkpoints were exported with PyTorch 2.5.1, Transformers 4.48.3, opset 17, legacy ONNX export, decoder-only FP16, and Albert kept in FP32. Do not use global --precision fp16; it converts Albert to FP16 and can break downstream models.

For a bucket B. The default release bucket is 64.

B=64
ALIGN=$((2 * B + 8))
SAMPLES=$((ALIGN * 480 + 3120))

uv run \
  --with torch==2.5.1 \
  --with transformers==4.48.3 \
  --with onnx \
  --with numpy \
  --with huggingface-hub \
  --with loguru \
  --with 'misaki[en]>=0.9.4' \
  python scripts/export_kokoro_torch_ttfc.py \
    --kokoro-repo demo-output/reexport/hexgrad-kokoro \
    --output "demo-output/reexport/family-frame480-margin3120/kokoro-82m-streaming-b${B}-decoder-fp16-frame480-margin3120.onnx" \
    --bucket "$B" \
    --fixed-alignment-frames "$ALIGN" \
    --fixed-output-samples "$SAMPLES" \
    --output-samples-per-frame 480 \
    --output-tail-margin-samples 3120 \
    --precision decoder-fp16 \
    --opset 17 \
    --legacy-export \
    --length-aware \
    --patch-fixed-lstm \
    --patch-scatterless-sine-source \
    --patch-split-adain \
    --patch-albert-sdpa-bool-mask-scale \
    --device cuda

All exported checkpoints in this family have 2707 ONNX nodes and 593 initializers.

Opset 24 / Attention Note

TensorRT 10.16 can parse ONNX opset 24 Attention when --stronglyTyped is enabled, but TensorRT's native IAttention layer requires GPUs with CUDA capability 8.0 or higher. To keep the published checkpoints usable on older hardware such as Turing/SM75, this repository intentionally publishes the opset 17 graph family instead of an opset 24 Attention export.

Voices and Languages

The included voice list follows Kokoro's language-prefix convention.

Language	Request `lang` values	Voices
American English	`a`, `en-us`, `en`	`af_`, `am_`
British English	`b`, `en-gb`	`bf_`, `bm_`
Japanese	`j`, `ja`, `ja-jp`	`jf_`, `jm_`
Mandarin Chinese	`z`, `zh`, `zh-cn`	`zf_`, `zm_`
Spanish	`e`, `es`, `es-es`	`ef_dora`, `em_alex`
French	`f`, `fr`, `fr-fr`	`ff_siwis`
Hindi	`h`, `hi`, `hi-in`	`hf_`, `hm_`
Italian	`i`, `it`, `it-it`	`if_sara`, `im_nicola`
Brazilian Portuguese	`p`, `pt`, `pt-br`	`pf_dora`, `pm_alex`, `pm_santa`

See voices.txt for the exact voice ordering.

References

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for msgflux/Kokoro-82M-streaming-onnx

Base model

yl4579/StyleTTS2-LJSpeech

Finetuned

hexgrad/Kokoro-82M

Quantized

(50)

this model

Papers for msgflux/Kokoro-82M-streaming-onnx

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models

Paper • 2306.07691 • Published Jun 13, 2023 • 18

iSTFTNet: Fast and Lightweight Mel-Spectrogram Vocoder Incorporating Inverse Short-Time Fourier Transform

Paper • 2203.02395 • Published Mar 4, 2022 • 2