Khmer Telephony ASR — v9 allsynth-lsall (epoch 7)

Streaming FastConformer–RNNT (BPE) acoustic model for Khmer call-center / telephony speech (8 kHz speech transcoded for G.711, 16 kHz input). This release fine-tunes on a large synthetic Khmer corpus combined with the full set of manually-annotated telephony utterances, heavily oversampled so the telephony domain dominates each epoch.

In-domain performance

Evaluated on the full ~7,600-utterance telephony set.

Metric Value
CER (no-space) 4.24%
CER (raw, spaces counted) 4.36%
Per-utterance CER bucket: < 10% 6 743 (88.6%)
10–30% 466 (6.1%)
30–50% 135 (1.8%)
50–80% 152 (2.0%)
≥ 80% 114 (1.5%)

Note: this run trained on the full telephony set, so the figures above overlap the training data and are not a held-out generalization estimate. They characterize in-domain fit, not out-of-sample accuracy.

WER on Khmer is a poor metric because the language has no whitespace word boundaries — CER is the meaningful number.

Model details

  • Architecture: FastConformer encoder (17 layers, d_model 512, 8 heads, causal downsampling, chunked-limited attention [[70,13],[70,6],[70,1],[70,0]] for streaming) + RNN-T decoder + BPE/Unigram tokenizer (vocab 2 048).
  • Init: from an in-house Khmer FastConformer-RNNT base checkpoint.
  • Fine-tuning data: a large synthetic Khmer telephony corpus mixed with the full set of ~4 000 manually-annotated short telephony utterances, the telephony portion oversampled (×15) so it forms a large share of each epoch.
  • Augmentation (training only):
    • G.711 transcode (prob 0.5)
    • speed perturbation 0.9×–1.1× (prob 0.3)
    • environmental noise mix, SNR 15–25 dB (prob 0.5)
    • gain −3/+6 dBFS (prob 0.5)
  • Epoch: 7 (val_wer 0.1840 — see note on WER above).
  • Optimizer: AdamW, Noam-Annealing, peak LR 0.5, warmup 200 steps, weight decay 1e-3.

Files

Path What it is
asr-khmer-telephony.nemo / epoch7.nemo Canonical NeMo bundle (encoder + decoder + joint + tokenizer + config). Load via EncDecRNNTBPEModel.restore_from(...).
checkpoints/*.ckpt Original Lightning checkpoint (~1.4 GB) for resume / surgery.
conf/khmer_finetune_allsynth_lsall.yaml Full training config (architecture + augmentation + optim).
tokenizer/{tokenizer.model,tokenizer.vocab,vocab.txt} SentencePiece tokenizer (also embedded in .nemo).

Usage

from nemo.collections.asr.models import EncDecRNNTBPEModel

model = EncDecRNNTBPEModel.restore_from("asr-khmer-telephony.nemo")
text = model.transcribe(["call_segment_8k.wav"])
print(text)

Input audio: 16 kHz mono PCM (8 kHz telephony should be upsampled to 16 kHz — the training augmentation simulates the G.711 distortion at 16 kHz).

Notes

  • Streaming chunk sizes encoded in att_context_size; 4 latency modes are bundled in the same model.
  • This model is domain-specialized for telephony; out-of-domain read-speech accuracy is not representative.

License

Apache-2.0 (model weights). Training data is internal.

Downloads last month
95
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support