Instructions to use mlboydaisuke/Kokoro-82M-LiteRT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- LiteRT
How to use mlboydaisuke/Kokoro-82M-LiteRT with LiteRT:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
Kokoro-82M — LiteRT (free-text, bucketed)
⚠️ Labeled preview — FP32, CPU. Arbitrary free text → speech (not a single baked sentence). The neural graphs are LiteRT
.tflite; two small steps run host-side (the hn-NSF source STFT and the final iSTFT overlap-add). GPU + quantization are the next steps.
A LiteRT (.tflite) conversion of
hexgrad/Kokoro-82M (StyleTTS2 + ISTFTNet) for
on-device free-text text-to-speech (arbitrary text in, 24 kHz speech out). Audio fidelity
≈ 0.9994 magnitude-spectrogram correlation to the PyTorch reference (verified across multiple
held-out sentences, not just the export sample).
Kokoro has one data-dependent length (the duration→alignment expansion L = sum(pred_dur)),
which litert_torch cannot keep dynamic (the LSTM sequence axis specializes). It is therefore
split into three fixed-bucket bundles with host steps between them, so arbitrary text works
by left-padding to the bucket and trimming the output (longer text is split into sentences
host-side, each ≤ the bucket):
text --(G2P, host)--> phoneme ids
1. kokoro_predictor.tflite : ids[1,128], ref_s[1,256], attn[1,128] -> duration, d, t_en
host: pred_dur = round(duration); alignment one-hot aln[1,128,512]; frame_mask[1,512]
2. kokoro_prosody.tflite : d, t_en, aln, ref_s, frame_mask -> asr, F0, N
host: har = STFT(SineGen(f0_upsamp(F0))) (the hn-NSF excitation)
3. kokoro_vocoder.tflite : asr, F0, N, har, ref_s, frame_mask -> spec, phase
host: iSTFT overlap-add(spec, phase) -> 24 kHz waveform; trim to L*600
Files
| File | Precision | Size | Role |
|---|---|---|---|
kokoro_predictor.tflite |
fp32 | ~91 MB | PL-BERT + duration/text encoders (masked unrolled bi-LSTMs) |
kokoro_prosody.tflite |
fp32 | ~37 MB | shared prosody LSTM + F0/N (masked) |
kokoro_vocoder.tflite |
fp32 | ~236 MB | iSTFTNet decoder → magnitude/phase spectrogram |
istft_Wr_f32.bin, istft_Wi_f32.bin |
fp32 | 880 B each | inverse-DFT bases for the host-side iSTFT |
Token bucket T = 128, frame bucket L = 512 (≈ 12.8 s of audio per chunk at 24 kHz).
Bundles are voice-independent — the voice is the ref_s input (a voices/*.pt from the
base repo, indexed by token-sequence length).
Specs
| Task | Text-to-speech (English), free text, 24 kHz mono |
| Source | hexgrad/Kokoro-82M (StyleTTS2 + ISTFTNet) |
| Fidelity | magspec-corr 0.9994 vs PyTorch (waveform corr ≈ 0.98 — the bounded bucket pad-boundary effect; the spectrum is what's perceived) |
| Runtime | CPU (LiteRT CompiledModel API) |
How it was converted
- Stock official converter (
litert_torch), general path — Kokoro is StyleTTS2/ISTFTNet, not a transformer LLM, so the Generative-API re-authoring path does not apply. - Three fixed-bucket bundles because the dynamic alignment length can't stay symbolic through the converter (the dynamic-LSTM wall). Every workaround is load-bearing and numerically faithful: the 6 bidirectional LSTMs are unrolled as masked bi-LSTMs that carry state through right-padding (a fused
nn.LSTMleaks pad tokens into the backward pass and wrecks prosody); the 58 AdaIN InstanceNorms normalize over real frames only so bucket pad frames don't poison the statistics; the hn-NSF source STFT runs host-side (its atan2 phase flips at the F0→0 pad boundary on-device). - iSTFT runs host-side (vocoder emits spec+phase): the in-graph conv-transpose iSTFT hits a converter weight-dedup bug that fuses the cos/sin DFT bases. The host overlap-add (no learned weights) is numerically exact.
Training data
Inherited from hexgrad/Kokoro-82M: a few hundred hours of permissive / non-copyrighted audio — public-domain audio, audio under permissive licenses (e.g. Koniwa tnc CC BY 3.0, SIWIS CC BY 4.0), and synthetic audio from large-provider TTS — paired with IPA phoneme labels. No proprietary or scraped personal recordings, no custom voice clones. This LiteRT artifact is a format conversion of the released weights and introduces no additional training data.
PII
No personally identifiable information is included. Per the base model's disclosure the training audio is permissive / public-domain / synthetic rather than scraped personal recordings; to the best of our knowledge the released weights contain no PII, and the conversion adds none.
Front-end (free-text G2P)
For arbitrary text with no dropped words (names, brands, numbers), pair with the neural grapheme-to-phoneme front-end: mlboydaisuke/Kokoro-G2P-en-US-LiteRT.
Roadmap
- GPU: the attention's fused-QKV >4-D layout + mask
EQUAL/SELECTkeep it on CPU; decomposing attention to ≤4-D would let it ride the GPU delegate. In-graph iSTFT on GPU additionally needs an FFT kernel in the LiteRT GPU delegate (ML Drift). - Quantization (int8/int4) is the obvious next step.
Status
Labeled preview — the converted model is parity-verified, but a clean runnable on-device LiteRT sample (type-any-text via the G2P front-end above) is still in progress, so this repo currently ships the model + weights only. (Conversion is a 3-bundle bucketed pipeline; the host steps — alignment, hn-NSF source STFT, iSTFT — run on the host.)
License
Apache-2.0, inherited from hexgrad/Kokoro-82M.
- Downloads last month
- 28