FastConformer-RNNT-BPE-Streaming-Khmer

Cache-aware streaming FastConformer RNN-T model for Khmer ASR, trained with NVIDIA NeMo. This is the v9b checkpoint.

Architecture: FastConformer encoder + RNN-T decoder & joint (no CTC head)
Tokenizer: SentencePiece unigram BPE, vocab size 4096, trained on the v9 Khmer no-space training manifest filtered to 0.1–10.0 s
Sample rate: 16 kHz
Streaming: cache-aware, multi-latency
Best checkpoint: val_wer = 0.1616 @ epoch 108 (no-space internal val set)

Files

FastConformer-RNNT-BPE-Streaming-Khmer.nemo / epoch108.nemo — single-file NeMo bundle (weights + tokenizer + config). Identical contents; the unversioned name is the canonical entry point. Load with NeMo.
checkpoints/FastConformer-RNNT-BPE-Streaming-Khmer-v9b--val_wer=0.1616-epoch=108.ckpt — Lightning checkpoint, useful for resuming training.
conf/khmer_finetune_v9b.yaml — training config for v9b.
tokenizer/ — standalone copy of the SentencePiece BPE tokenizer (tokenizer.model, tokenizer.vocab, vocab.txt).

Usage

import nemo.collections.asr as nemo_asr

model = nemo_asr.models.EncDecRNNTBPEModel.from_pretrained(
    "actableai/fastconformer-rnnt-bpe-streaming-khmer"
)
transcripts = model.transcribe(["audio.wav"])
print(transcripts)

For cache-aware streaming inference, see NeMo's examples/asr/asr_cache_aware_streaming/ scripts.

Training

Init: encoder weights warm-started from the v9 .nemo; decoder + joint reinitialized because the tokenizer changed (vocab 4096 BPE).
Hardware: 4 × A100 40 GB, bf16.
Loss: RNN-T (warp-rnnt, fastemit λ = 0.002).
Data window: 0.1–10.0 s, sampled from in-house Khmer manifests with whitespace removed from transcripts.
Augmentation pipeline (each step has its own probability): speed perturbation (0.77–1.00) → noise injection (env, 5–20 dB SNR) → babble injection (Khmer, 0.4–1.5 s segments) → G.711 transcode → gain (−20 to 0 dBFS).

Evaluation

Internal validation (no-space transcripts), top epochs:

epoch	val_wer
108	0.1616 (selected)
112	0.1633
110	0.1644
106	0.1669

External test sets (numbers from epoch 106, the closest epoch with a clean external eval):

set	n	CER (raw)	CER (no-space)	WER
FLEURS test (Khmer)	771	0.296	0.274	0.999
Label Studio Metfone (in-house calls)	1820	0.394	0.372	0.839

The model is trained without spaces, so word-level WER on space-segmented references (FLEURS) is not meaningful — use CER (no-space) as the primary external metric.

Limitations

Trained on a mix of in-house Khmer call data and synthetic speech. Quality varies on out-of-domain audio, code-switched speech, or non-call noise profiles.
Output is whitespace-free; downstream consumers that need word-level segmentation must add a separate segmenter.

License

Released under CC-BY-4.0.

Downloads last month: 101