Kokoro G2P (en-US) — LiteRT (preview)

⚠️ Labeled preview — fixed-length [1, 96], FP32, CPU. Shared to complete the on-device Kokoro front-end per the LiteRT community direction ③.

A LiteRT (.tflite) conversion of DeepPhonemizer en_us_cmudict_forward (a small non-autoregressive forward Transformer), used as the neural grapheme-to-phoneme (G2P) front-end for on-device Kokoro-82M TTS. It gives Kokoro a phonemizer fallback so arbitrary free text — names, brands, numbers — synthesizes with zero dropped words when the dictionary phonemizer misses.

Files

File Precision Size
dp_g2p_litert.tflite fp32 ~51 MB

Specs

Task Grapheme-to-phoneme (English)
Source DeepPhonemizer en_us_cmudict_forward
Input 1 × 96 character IDs (fixed length 96, in-graph padding mask)
Output per-position phoneme logits → ARPABET / IPA
Runtime CPU (LiteRT CompiledModel API)
Verified Pixel 8a — 12/12 vs the reference G2P, no dropped words

How it was converted / why CPU

  • Stock official converter (litert_torch), static-shape graph: the dynamic-length export hits the same symbolic-sequence-length wall as the TTS model (Shapes must be 1D sequences of concrete values…) — the C8 / dynamic-shape class. Worked around with a static [1, 96] graph + an in-graph padding mask; converts cleanly and is numerically correct.
  • CPU-only: the attention's fused-QKV 5-D layout + the mask's EQUAL / SELECT_V2 keep it off the GPU delegate; decomposing the attention to ≤ 4-D would clear that.

Training data

DeepPhonemizer en_us_cmudict_forward is trained on the CMU Pronouncing Dictionary (CMUdict) — ~126k common English words paired with ARPABET pronunciations (a public pronunciation lexicon). It learns the grapheme→phoneme spelling-to-sound mapping only. This LiteRT artifact is a format conversion of the released checkpoint and introduces no additional training data.

PII

No personally identifiable information. CMUdict is a public dictionary of common English word pronunciations (no personal data); none is added during conversion.

Roadmap

  • Variable-length + quantized + (ideally) GPU is gated on the dynamic-shape converter work (C8) and a ≤ 4-D attention re-author.

Status

Labeled preview — part of an on-device free-text Kokoro-82M LiteRT pipeline; a clean runnable Android sample is in progress.

License

MIT. Full attribution to DeepPhonemizer and the en_us_cmudict_forward checkpoint.

Downloads last month
19
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support