Kokoro-82M Streaming ONNX
Fixed-bucket ONNX exports of hexgrad/Kokoro-82M for low-latency local streaming TTS with FastKokoro and ONNX Runtime.
These checkpoints use static token buckets, length-aware padding masks, decoder FP16 weights, opset 17, and graph patches that remove the previous dynamic tail and scatter-heavy export patterns. The current family uses 480 output samples per predicted alignment frame plus a 3120-sample fixed tail margin.
TensorRT EP with engine/timing cache is the recommended GPU runtime. The models also run with CUDA EP and CPU EP, but the latency expectations below are from TensorRT cache-hit measurements.
Files
onnx/kokoro-82m-streaming-b64-fp16.onnx is the recommended default because it
prioritizes smoother short-phrase continuity while still measuring about 105 ms
cache-hit model-call latency with TensorRT EP on the measured GTX 1650 host.
b48 remains the lower-latency balanced option, b24 is available for extremely
low latency when 2-3 word chunks are acceptable, and b16 is experimental and
should be treated as a single-word checkpoint.
| File | Bucket | Intended use |
|---|---|---|
onnx/kokoro-82m-streaming-b16-fp16.onnx |
16 | Lowest latency, single words only |
onnx/kokoro-82m-streaming-b24-fp16.onnx |
24 | Ultra-low-latency opt-in |
onnx/kokoro-82m-streaming-b32-fp16.onnx |
32 | Short phrases |
onnx/kokoro-82m-streaming-b48-fp16.onnx |
48 | Balanced low-latency option |
onnx/kokoro-82m-streaming-b64-fp16.onnx |
64 | Recommended default |
onnx/kokoro-82m-streaming-b96-fp16.onnx |
96 | Recommended large bucket |
onnx/kokoro-82m-streaming-b128-fp16.onnx |
128 | Optional long-continuity bucket |
voices.npz |
- | FastKokoro voice/style embeddings |
voices.txt |
- | Voice index compatible with Kokoro/NVIDIA voice ordering |
Model Inputs and Outputs
Each bucket has the same tensor contract; only token width, duration width, and fixed waveform length change.
| Tensor | Shape | Type | Description |
|---|---|---|---|
input_ids |
[1, bucket] |
int64 |
Token IDs padded to the selected bucket |
style |
[1, 256] |
float32 |
Voice/style embedding from voices.npz |
speed |
[1] |
float32 |
Speed factor, usually 1.0 |
input_lengths |
[1] |
int64 |
Number of valid tokens before padding |
| Output | Shape | Type | Description |
|---|---|---|---|
waveform |
[samples] |
float32 |
Raw 24 kHz mono waveform |
duration |
[bucket] |
int64 |
Predicted token durations |
Bucket Guidance
Two token positions are reserved by the model, so usable text capacity is
bucket - 2 phoneme tokens. Practical word capacity is lower and depends on
language, punctuation, voice, and speed because the model also predicts
duration. The table below is the observed safe expectation from
English/Portuguese probes at speed 0.85; speed 1.0 can fit roughly one
extra word in some buckets.
The default sentence strategy uses this conservative capacity before falling
back to the model's real phonemized token width, so long sentences are split
before late words land at the tail of a near-full fixed-output window.
| Bucket | Usable tokens | Alignment frames | Output samples | Expected words | TensorRT p50 | Notes |
|---|---|---|---|---|---|---|
| 16 | 14 | 40 | 22,320 | 1-2 | 33 ms | Experimental; single words only |
| 24 | 22 | 56 | 30,000 | 3 | 45 ms | Ultra-low-latency opt-in |
| 32 | 30 | 72 | 37,680 | 4-5 | 54 ms | Short phrases |
| 48 | 46 | 104 | 53,040 | 6 | 76 ms | Balanced low-latency option |
| 64 | 62 | 136 | 68,400 | 6 | 105 ms | Recommended default |
| 96 | 94 | 200 | 99,120 | 14 | 134 ms | Recommended large bucket |
| 128 | 126 | 264 | 129,840 | 18 | 272 ms | Optional long-continuity bucket |
Latency was measured on a GTX 1650 (SM75), TensorRT 10.11 / ONNX Runtime 1.22, with TensorRT engine and timing cache already built. First-time TensorRT engine builds took roughly 2-4 minutes per bucket on that host.
Quickstart
This repository is intended to be used with msgflux/fastkokoro, an OpenAI-compatible local TTS server for Kokoro ONNX models.
Install FastKokoro with an ONNX Runtime backend:
uv pip install "fastkokoro[gpu]"
For CPU-only environments:
uv pip install "fastkokoro[cpu]"
Start the server:
fastkokoro
Generate speech through the OpenAI-compatible endpoint:
curl http://localhost:8880/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"model": "kokoro",
"voice": "pf_dora",
"input": "Oi, tudo bem? Este audio esta sendo gerado com o bucket streaming b64.",
"response_format": "pcm"
}' \
--output speech.pcm
Configuration
Recent FastKokoro releases use this repository as the default streaming model. If you need to configure it explicitly, set:
FASTKOKORO_MODEL_REPO=msgflux/Kokoro-82M-streaming-onnx
FASTKOKORO_MODEL_FILE=onnx/kokoro-82m-streaming-b64-fp16.onnx
FASTKOKORO_VOICES_FILE=voices.npz
TensorRT EP with CUDA and CPU fallback:
FASTKOKORO_ONNX_PROVIDERS=TensorrtExecutionProvider,CUDAExecutionProvider,CPUExecutionProvider
FASTKOKORO_ONNX_PROVIDER_OPTIONS='{"TensorrtExecutionProvider":{"trt_engine_cache_enable":"True","trt_engine_cache_path":"/models/trt-cache","trt_timing_cache_enable":"True","trt_timing_cache_path":"/models/trt-cache"}}'
CUDA EP fallback:
FASTKOKORO_ONNX_PROVIDERS=CUDAExecutionProvider,CPUExecutionProvider
To choose a different latency/continuity tradeoff:
FASTKOKORO_MODEL_FILE=onnx/kokoro-82m-streaming-b16-fp16.onnx
FASTKOKORO_MODEL_FILE=onnx/kokoro-82m-streaming-b24-fp16.onnx
FASTKOKORO_MODEL_FILE=onnx/kokoro-82m-streaming-b32-fp16.onnx
FASTKOKORO_MODEL_FILE=onnx/kokoro-82m-streaming-b48-fp16.onnx
FASTKOKORO_MODEL_FILE=onnx/kokoro-82m-streaming-b64-fp16.onnx
FASTKOKORO_MODEL_FILE=onnx/kokoro-82m-streaming-b96-fp16.onnx
FASTKOKORO_MODEL_FILE=onnx/kokoro-82m-streaming-b128-fp16.onnx
Use response_format: "pcm" for the lowest-overhead streaming response. Use
wav when you need a self-contained audio file with headers. Pass lang in
each request to force a specific language or locale, for example pt-br,
en-us, ja-jp, or es-es.
FastKokoro uses FASTKOKORO_STREAM_STRATEGY=sentence by default. It synthesizes
one sentence at a time and only splits further when the loaded ONNX bucket's real
phonemized token width requires it. adaptive and chunk remain available for
lower first-audio latency, but they use scheduled word-boundary chunks and can
sound less natural. FASTKOKORO_STREAM_MAX_SEGMENT_WORDS and
FASTKOKORO_STREAM_MAX_SEGMENT_CHARS are intentionally unset unless you want an
explicit user override for scheduled strategies; static buckets still cap the
request to their safe token-width capacity. PCM streaming inserts
FASTKOKORO_STREAM_BOUNDARY_SILENCE_MS=80 between adjacent generated text
segments by default, which keeps split transitions from sounding glued together.
Explicit [pause:...] tags control their own silence.
Export Recipe
The current checkpoints were exported with PyTorch 2.5.1, Transformers 4.48.3,
opset 17, legacy ONNX export, decoder-only FP16, and Albert kept in FP32. Do not
use global --precision fp16; it converts Albert to FP16 and can break
downstream models.
For a bucket B. The default release bucket is 64.
B=64
ALIGN=$((2 * B + 8))
SAMPLES=$((ALIGN * 480 + 3120))
uv run \
--with torch==2.5.1 \
--with transformers==4.48.3 \
--with onnx \
--with numpy \
--with huggingface-hub \
--with loguru \
--with 'misaki[en]>=0.9.4' \
python scripts/export_kokoro_torch_ttfc.py \
--kokoro-repo demo-output/reexport/hexgrad-kokoro \
--output "demo-output/reexport/family-frame480-margin3120/kokoro-82m-streaming-b${B}-decoder-fp16-frame480-margin3120.onnx" \
--bucket "$B" \
--fixed-alignment-frames "$ALIGN" \
--fixed-output-samples "$SAMPLES" \
--output-samples-per-frame 480 \
--output-tail-margin-samples 3120 \
--precision decoder-fp16 \
--opset 17 \
--legacy-export \
--length-aware \
--patch-fixed-lstm \
--patch-scatterless-sine-source \
--patch-split-adain \
--patch-albert-sdpa-bool-mask-scale \
--device cuda
All exported checkpoints in this family have 2707 ONNX nodes and 593 initializers.
Opset 24 / Attention Note
TensorRT 10.16 can parse ONNX opset 24 Attention when --stronglyTyped is
enabled, but TensorRT's native IAttention layer requires GPUs with CUDA
capability 8.0 or higher. To keep the published checkpoints usable on older
hardware such as Turing/SM75, this repository intentionally publishes the opset
17 graph family instead of an opset 24 Attention export.
Voices and Languages
The included voice list follows Kokoro's language-prefix convention.
| Language | Request lang values |
Voices |
|---|---|---|
| American English | a, en-us, en |
af_*, am_* |
| British English | b, en-gb |
bf_*, bm_* |
| Japanese | j, ja, ja-jp |
jf_*, jm_* |
| Mandarin Chinese | z, zh, zh-cn |
zf_*, zm_* |
| Spanish | e, es, es-es |
ef_dora, em_alex |
| French | f, fr, fr-fr |
ff_siwis |
| Hindi | h, hi, hi-in |
hf_*, hm_* |
| Italian | i, it, it-it |
if_sara, im_nicola |
| Brazilian Portuguese | p, pt, pt-br |
pf_dora, pm_alex, pm_santa |
See voices.txt for the exact voice ordering.