PrimeTTS — tiny bilingual zh‑TW + English TTS (8 kHz, CPU)

A 4.63M‑parameter Mandarin (Taiwan) + English text‑to‑speech model that runs entirely on CPU and emits 8 kHz audio — sized for G.711 telephony and on‑device (Jetson‑class) use. One model, one young‑female voice: Chinese, English, and code‑mix through a single frontend (no language routing).

🔊 Live demo: https://huggingface.co/spaces/Luigi/PrimeTTS-vs-Inflect-Nano-v1 · 🧩 Base: owensong/Inflect-Nano-v1 (warm‑started fine‑tune, same frozen architecture)


Parameters	4.63M (3.47M acoustic + 1.17M vocoder)
Sample rate	8 kHz (telephony‑band)
Runtime	`onnxruntime`, CPU‑only, torch‑free at inference
Languages	zh‑TW (Traditional) + English + code‑mix, single voice
Voice	young female, Taiwan‑Mandarin accent
Architecture	FastSpeech‑style (no attention) + Snake‑HiFiGAN — frozen, no NAS
License	Apache‑2.0

Held‑out quality (eval_big, 36 unseen phone‑attendant sentences)

metric	this model	prior 8k release
zh‑CER (Breeze‑ASR‑25)	0.090	0.132
code‑mix CER	0.178	0.254
en‑WER (Whisper)	0.083	0.092
Taiwan‑accent gap¹	+0.088	+0.017
SQUIM PESQ	3.31	2.07
SQUIM MOS	4.24	3.30

¹ CER(generic ASR) − CER(Taiwan‑tuned Breeze‑ASR‑25) per zh clip; >0 ⇒ a Taiwan‑tuned recognizer understands it better ⇒ genuine Taiwan accent present.

Quickstart (inference, CPU)

pip install onnxruntime numpy soundfile g2pw g2p_en cn2an
huggingface-cli download Luigi/PrimeTTS --local-dir PrimeTTS

# from inside the PrimeTTS dir (uses the bundled frontend + scripts)
import sys; sys.path.insert(0, "scripts")
import json, numpy as np, onnxruntime as ort, soundfile as sf
import frontend_bopomofo as F
from synth_from_text import host_regulate          # numpy length‑regulator

meta = json.load(open("meta.json"))
enc = ort.InferenceSession("acoustic_encoder.onnx", providers=["CPUExecutionProvider"])
dec = ort.InferenceSession("acoustic_decoder.onnx", providers=["CPUExecutionProvider"])
voc = ort.InferenceSession("vocoder.onnx",          providers=["CPUExecutionProvider"])

o = F.text_to_ids("您好,歡迎使用 PrimeTTS。Thank you for calling.")   # text -> phone/tone/lang ids
ph, tn, lg = (np.array([o[k]], np.int64) for k in ("phone_ids", "tone_ids", "lang_ids"))
cond, dur, pitch = enc.run(None, {"phone": ph, "tone": tn,
                                  "lang": lg, "speaker": np.zeros(1, np.int64)})
reg = host_regulate(cond, dur, pitch, meta["abs_frame_bins"], meta["max_frames"])
mel = dec.run(None, {k: reg[k] for k in
      ["frames","frame_meta","local_ctx_raw","abs_pos","pitch_frame","frame_mask"]})[0]
wav = voc.run(None, {"mel": mel.astype(np.float32)})[0].reshape(-1)
sf.write("out.wav", wav, meta["sample_rate"])

The whole pipeline — encoder.onnx → numpy length‑regulator → decoder.onnx → vocoder.onnx — is torch‑free and runs as‑is on a Jetson Nano CPU. See scripts/synth_from_text.py for the full runtime.

Training data

Everything is distilled from a single teacher voice so zh / en / code‑mix share one timbre and accent.

Reference voice — a young Taiwan‑female speaker from Mozilla Common Voice zh‑TW, released CC0 / public domain (commercial‑use and voice‑cloning clear). ~13 s assembled from that one speaker's cleanest validated clips. This fixes the accent (Taiwan Mandarin comes from the reference, not from prompting) and keeps the model commercially shippable — no proprietary/voice‑likeness encumbrance.
Teacher — VoxCPM2 (openbmb/VoxCPM2) voice‑clones that one reference for every line, giving a consistent young‑female voice across all three languages (48 kHz, resampled to 8 kHz for training).
Text — Taiwan office / phone‑attendant register: diverse Mandarin, general + domain English, and frame‑bank code‑mix (gen_codemix.py) with English in varied positions.
ASR quality gate — every clip is transcribed and kept only if it matches its text, using a Taiwan‑tuned recognizer so the gate never penalizes the accent we want:
- zh & code‑mix → Breeze‑ASR‑25 Han‑level CER (≤ 0.12 zh, ≤ 0.15 mix)
- English → Whisper‑medium WER (≤ 0.20)

split	clean clips	dropped by gate
pure Chinese	2,077	16.9%
code‑mix (zh+en)	2,571	8.2%
pure English	1,975	1.2%
total	6,623	9.3%

How it was trained — the levers

Inflect‑Nano‑v1's 4.63M architecture is not capacity‑limited for this task. Quality came from four fixable things, all keeping the architecture frozen (no NAS, no param changes):

Phone‑level alignment (align_durations_v4.py) — true per‑phone durations (espeak phoneme‑CTC + torchaudio.forced_align) instead of crude char/letter CTC. Skipping this is what makes tiny TTS garble.
Vocabulary coverage + diverse code‑mix — broad character coverage and a code‑mix frame bank (varied syntax, English in varied positions) so the model isn't overfit to a few templates.
Teacher choice — the English a tiny model learns is only as native as the teacher's. A Taiwan‑ biased teacher gave flat, accented English; VoxCPM2 gives clean, natural zh and en in one voice.
Warm‑start from Inflect‑Nano‑v1 — the acoustic model is initialized from the English‑native v1 checkpoint (199/199 tensors copied, 0 skipped — the bilingual symbol table already matches), so v1's English transfers directly; the corpus then teaches Taiwan Mandarin on top.

A 2D mel‑GAN discriminator (training‑only; ONNX is unchanged) sharpens the mel after a 25k pure‑ reconstruction warmup, lifting PESQ/MOS. The shipped checkpoint is the 35k step — the held‑out sweet spot, before the GAN starts over‑sharpening held‑out intelligibility.

Architecture

Acoustic — MicroFastSpeech (~3.47M): depthwise Conv‑FFN, no attention, external durations + length regulator, frame‑pitch, BiGRU, postnet.
Vocoder — Snake‑HiFiGAN (~1.17M), 8 kHz variant snake_8k (sr 8000, n_fft 512, hop 128, 80 mels), retrained on the teacher corpus.
Frontend — g2pw (Taiwan bopomofo + polyphone disambiguation) + g2p_en (arpabet), merged into one phone sequence with per‑phone language ids → handles zh, en, and code‑mix in a single pass.

Reproduce / fine‑tune your own

Pipeline: teacher corpus → ASR gate → align → train vocoder → warm‑start + train acoustic → export. Repo layout:

acoustic_encoder.onnx  acoustic_decoder.onnx  vocoder.onnx  meta.json  symbol_table.json   ← deployable weights
acoustic_tw_8k_35k.pt                                                                       ← shipped checkpoint
scripts/        frontend, aligner, corpus‑gen, diverse‑text, train, export, eval
inflect_nano/   the trainer (acoustic.py + vocoder.py), forked from Inflect‑Nano‑v1 (LICENSE included)

Prerequisites: Python 3.12, a GPU for training; pip install torch torchaudio transformers onnxruntime soundfile librosa g2pw g2p_en cn2an opencc faster-whisper edge-tts.

1 · Teacher corpus (one cloned voice)

# make a Taiwan‑female reference, then VoxCPM2‑clone every line in that voice
edge-tts --voice zh-TW-HsiaoChenNeural --text "<ref sentence>" --write-media ref.mp3
python gen_voxcpm_corpus.py --texts texts.jsonl --ref ref.wav --ref-text ref.txt \
       --out-dir corpus --manifest manifest.jsonl

2 · ASR quality gate (Taiwan‑tuned)

python asr_filter.py --manifest manifest.jsonl --out manifest \
       --device cuda            # Breeze‑ASR‑25 (zh/mix) + Whisper‑medium (en) → manifest.clean.jsonl

3 · Phone‑level alignment ⭐ the key step

python scripts/align_durations_v4.py --manifest manifest.clean.jsonl --out align.jsonl

4 · Train the 8 kHz vocoder

PYTHONPATH=. python -m inflect_nano.vocoder --train-jsonl voc_rows.jsonl \
  --out-dir vocoder_8k --variant snake_8k --steps 40000 --stft-weight 2.0

5 · Warm‑start + train the acoustic (GAN recipe)

PYTHONPATH=. python -m inflect_nano.acoustic --durations-jsonl align.jsonl \
  --out-dir acoustic_8k --vocoder-variant snake_8k --sample-rate 8000 \
  --vocoder-checkpoint vocoder_8k/hifigan-snake_8k-final.pt --vocoder-mel-weight 1.0 \
  --init-checkpoint inflect_nano_v1_acoustic.pt \
  --mel-gan-weight 0.1 --gan-2d --gan-fm-auto --gan-r1-gamma 1.0 --gan-crop 128 --gan-warmup-steps 25000 \
  --steps 60000 --batch-size 16 --en-upsample 1

6 · Export to ONNX + evaluate

python scripts/export_8k.py --acoustic-ckpt acoustic_8k/…pt --vocoder-ckpt vocoder_8k/…pt --out-dir onnx/
python scripts/synth_from_text.py --onnx-dir onnx --out-dir syn --texts eval.jsonl
python scripts/assess_big.py --synth-dir syn        # offline CER/WER

Evaluate on ≥30 held‑out sentences — small eval sets are too noisy to trust. Sweep checkpoints and pick the held‑out sweet spot (the GAN keeps improving train‑set sharpness past the held‑out optimum).

Train on your OWN voice — one command

Swap the reference voice; everything else (text pools, ASR gate, alignment, recipe) is fixed. Both vocoder and acoustic are retrained (both are voice-specific). Text pools + eval sets are bundled in data/ and at the repo root, so it reproduces exactly.

# 0. one venv with the deps (see prereqs in scripts/rebuild_voice.sh), PYTHONPATH=repo root,
#    and inflect_nano_v1_acoustic.pt from owensong/Inflect-Nano-v1 for the warm-start.
huggingface-cli download Luigi/PrimeTTS --local-dir PrimeTTS && cd PrimeTTS
cp data/*.jsonl data/*.txt .            # text pools at root

# 1. a ~10 s clip of your voice. For a commercial-clear reference, use a CC0 source such as
#    Mozilla Common Voice zh-TW (the shipped model uses a young-female Common Voice speaker). Or synth one:
edge-tts --voice zh-TW-HsiaoYuNeural --text "您好,歡迎來電。Thank you for calling." --write-media ref.mp3
ffmpeg -y -i ref.mp3 -ar 24000 -ac 1 ref.wav ; printf '%s' "您好,歡迎來電。Thank you for calling." > ref.txt

# 2. ONE command -> corpus -> gate -> align -> vocoder -> acoustic -> export
PY=/path/to/venv/bin/python ./scripts/rebuild_voice.sh ref.wav ref.txt myvoice
# -> pick best corpus_myvoice/onnx_<K>/  (~35k is the usual held-out sweet spot)

Time on dual RTX 5090: ≈ 9 h end-to-end (~~6.5 h to a shippable 35k checkpoint) — synth ~2 h, gate+align ~25 min, then vocoder (~~3 h) ∥ acoustic (~4–7 h) in parallel, export ~15 min.

Credits & licenses

Base model / trainer: owensong/Inflect-Nano-v1 (Apache‑2.0; see inflect_nano/LICENSE.inflect-nano)
Teacher TTS: openbmb/VoxCPM2 · Reference voice: Mozilla Common Voice zh‑TW (CC0 / public domain)
Gate ASR: Breeze-ASR-25 (MediaTek Research, Taiwan Mandarin + code‑switch) · OpenAI Whisper‑medium
Aligner: facebook/wav2vec2-lv-60-espeak-cv-ft + torchaudio.forced_align
Frontend: g2pw (Taiwan readings) + g2p_en · Eval ASR: sherpa‑onnx X‑ASR (zh‑en Zipformer)

This repository: Apache‑2.0.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for Luigi/PrimeTTS

Base model

owensong/Inflect-Nano-v1

Finetuned

(1)

this model

Luigi
/

PrimeTTS