FastConformer-RNNT-BPE-Streaming-Khmer

Cache-aware streaming FastConformer RNN-T model for Khmer ASR, trained with NVIDIA NeMo. This is the v9b checkpoint.

  • Architecture: FastConformer encoder + RNN-T decoder & joint (no CTC head)
  • Tokenizer: SentencePiece unigram BPE, vocab size 4096, trained on the v9 Khmer no-space training manifest filtered to 0.1–10.0 s
  • Sample rate: 16 kHz
  • Streaming: cache-aware, multi-latency
  • Best checkpoint: val_wer = 0.1616 @ epoch 108 (no-space internal val set)

Files

  • FastConformer-RNNT-BPE-Streaming-Khmer.nemo / epoch108.nemo β€” single-file NeMo bundle (weights + tokenizer + config). Identical contents; the unversioned name is the canonical entry point. Load with NeMo.
  • checkpoints/FastConformer-RNNT-BPE-Streaming-Khmer-v9b--val_wer=0.1616-epoch=108.ckpt β€” Lightning checkpoint, useful for resuming training.
  • conf/khmer_finetune_v9b.yaml β€” training config for v9b.
  • tokenizer/ β€” standalone copy of the SentencePiece BPE tokenizer (tokenizer.model, tokenizer.vocab, vocab.txt).

Usage

import nemo.collections.asr as nemo_asr

model = nemo_asr.models.EncDecRNNTBPEModel.from_pretrained(
    "actableai/fastconformer-rnnt-bpe-streaming-khmer"
)
transcripts = model.transcribe(["audio.wav"])
print(transcripts)

For cache-aware streaming inference, see NeMo's examples/asr/asr_cache_aware_streaming/ scripts.

Training

  • Init: encoder weights warm-started from the v9 .nemo; decoder + joint reinitialized because the tokenizer changed (vocab 4096 BPE).
  • Hardware: 4 Γ— A100 40 GB, bf16.
  • Loss: RNN-T (warp-rnnt, fastemit Ξ» = 0.002).
  • Data window: 0.1–10.0 s, sampled from in-house Khmer manifests with whitespace removed from transcripts.
  • Augmentation pipeline (each step has its own probability): speed perturbation (0.77–1.00) β†’ noise injection (env, 5–20 dB SNR) β†’ babble injection (Khmer, 0.4–1.5 s segments) β†’ G.711 transcode β†’ gain (βˆ’20 to 0 dBFS).

Evaluation

Internal validation (no-space transcripts), top epochs:

epoch val_wer
108 0.1616 (selected)
112 0.1633
110 0.1644
106 0.1669

External test sets (numbers from epoch 106, the closest epoch with a clean external eval):

set n CER (raw) CER (no-space) WER
FLEURS test (Khmer) 771 0.296 0.274 0.999
Label Studio Metfone (in-house calls) 1820 0.394 0.372 0.839

The model is trained without spaces, so word-level WER on space-segmented references (FLEURS) is not meaningful β€” use CER (no-space) as the primary external metric.

Limitations

  • Trained on a mix of in-house Khmer call data and synthetic speech. Quality varies on out-of-domain audio, code-switched speech, or non-call noise profiles.
  • Output is whitespace-free; downstream consumers that need word-level segmentation must add a separate segmenter.

License

Released under CC-BY-4.0.

Downloads last month
101
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support