Kokoro-82M — LiteRT (free-text, bucketed)

⚠️ Labeled preview — FP32, CPU. Arbitrary free text → speech (not a single baked sentence). The neural graphs are LiteRT .tflite; two small steps run host-side (the hn-NSF source STFT and the final iSTFT overlap-add). GPU + quantization are the next steps.

A LiteRT (.tflite) conversion of hexgrad/Kokoro-82M (StyleTTS2 + ISTFTNet) for on-device free-text text-to-speech (arbitrary text in, 24 kHz speech out). Audio fidelity ≈ 0.9994 magnitude-spectrogram correlation to the PyTorch reference (verified across multiple held-out sentences, not just the export sample).

Kokoro has one data-dependent length (the duration→alignment expansion L = sum(pred_dur)), which litert_torch cannot keep dynamic (the LSTM sequence axis specializes). It is therefore split into three fixed-bucket bundles with host steps between them, so arbitrary text works by left-padding to the bucket and trimming the output (longer text is split into sentences host-side, each ≤ the bucket):

text --(G2P, host)--> phoneme ids
1. kokoro_predictor.tflite : ids[1,128], ref_s[1,256], attn[1,128] -> duration, d, t_en
   host: pred_dur = round(duration); alignment one-hot aln[1,128,512]; frame_mask[1,512]
2. kokoro_prosody.tflite   : d, t_en, aln, ref_s, frame_mask       -> asr, F0, N
   host: har = STFT(SineGen(f0_upsamp(F0)))     (the hn-NSF excitation)
3. kokoro_vocoder.tflite   : asr, F0, N, har, ref_s, frame_mask    -> spec, phase
   host: iSTFT overlap-add(spec, phase) -> 24 kHz waveform; trim to L*600

Files

File Precision Size Role
kokoro_predictor.tflite fp32 ~91 MB PL-BERT + duration/text encoders (masked unrolled bi-LSTMs)
kokoro_prosody.tflite fp32 ~37 MB shared prosody LSTM + F0/N (masked)
kokoro_vocoder.tflite fp32 ~236 MB iSTFTNet decoder → magnitude/phase spectrogram
istft_Wr_f32.bin, istft_Wi_f32.bin fp32 880 B each inverse-DFT bases for the host-side iSTFT

Token bucket T = 128, frame bucket L = 512 (≈ 12.8 s of audio per chunk at 24 kHz). Bundles are voice-independent — the voice is the ref_s input (a voices/*.pt from the base repo, indexed by token-sequence length).

Specs

Task Text-to-speech (English), free text, 24 kHz mono
Source hexgrad/Kokoro-82M (StyleTTS2 + ISTFTNet)
Fidelity magspec-corr 0.9994 vs PyTorch (waveform corr ≈ 0.98 — the bounded bucket pad-boundary effect; the spectrum is what's perceived)
Runtime CPU (LiteRT CompiledModel API)

How it was converted

  • Stock official converter (litert_torch), general path — Kokoro is StyleTTS2/ISTFTNet, not a transformer LLM, so the Generative-API re-authoring path does not apply.
  • Three fixed-bucket bundles because the dynamic alignment length can't stay symbolic through the converter (the dynamic-LSTM wall). Every workaround is load-bearing and numerically faithful: the 6 bidirectional LSTMs are unrolled as masked bi-LSTMs that carry state through right-padding (a fused nn.LSTM leaks pad tokens into the backward pass and wrecks prosody); the 58 AdaIN InstanceNorms normalize over real frames only so bucket pad frames don't poison the statistics; the hn-NSF source STFT runs host-side (its atan2 phase flips at the F0→0 pad boundary on-device).
  • iSTFT runs host-side (vocoder emits spec+phase): the in-graph conv-transpose iSTFT hits a converter weight-dedup bug that fuses the cos/sin DFT bases. The host overlap-add (no learned weights) is numerically exact.

Training data

Inherited from hexgrad/Kokoro-82M: a few hundred hours of permissive / non-copyrighted audio — public-domain audio, audio under permissive licenses (e.g. Koniwa tnc CC BY 3.0, SIWIS CC BY 4.0), and synthetic audio from large-provider TTS — paired with IPA phoneme labels. No proprietary or scraped personal recordings, no custom voice clones. This LiteRT artifact is a format conversion of the released weights and introduces no additional training data.

PII

No personally identifiable information is included. Per the base model's disclosure the training audio is permissive / public-domain / synthetic rather than scraped personal recordings; to the best of our knowledge the released weights contain no PII, and the conversion adds none.

Front-end (free-text G2P)

For arbitrary text with no dropped words (names, brands, numbers), pair with the neural grapheme-to-phoneme front-end: mlboydaisuke/Kokoro-G2P-en-US-LiteRT.

Roadmap

  • GPU: the attention's fused-QKV >4-D layout + mask EQUAL/SELECT keep it on CPU; decomposing attention to ≤4-D would let it ride the GPU delegate. In-graph iSTFT on GPU additionally needs an FFT kernel in the LiteRT GPU delegate (ML Drift).
  • Quantization (int8/int4) is the obvious next step.

Status

Labeled preview — the converted model is parity-verified, but a clean runnable on-device LiteRT sample (type-any-text via the G2P front-end above) is still in progress, so this repo currently ships the model + weights only. (Conversion is a 3-bundle bucketed pipeline; the host steps — alignment, hn-NSF source STFT, iSTFT — run on the host.)

License

Apache-2.0, inherited from hexgrad/Kokoro-82M.

Downloads last month
28
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mlboydaisuke/Kokoro-82M-LiteRT

Finetuned
(35)
this model