Khmer ASR Encoder Benchmark and Pretrained Checkpoints

This repository contains pretrained checkpoints and benchmarking results used to identify the most suitable speech encoder backbone for a Khmer Automatic Speech Recognition (ASR) system.

The goal of this project was to compare the transferability of different self-supervised speech representations for Khmer ASR before investing in large-scale fine-tuning.

Overview

Three widely used pretrained speech encoders were evaluated:

  • Whisper Base (openai/whisper-base)
  • WavLM Base
  • Wav2Vec2 Base

To ensure a fair comparison, each encoder was:

  1. Initialized from publicly available pretrained weights.
  2. Frozen during training.
  3. Connected to a lightweight CTC classification head.
  4. Trained and evaluated using the same Khmer speech datasets.

The resulting validation CTC loss was used as the primary evaluation metric.


Training Methodology

Phase 1: Frozen Encoder Evaluation

For each backbone:

  • The encoder parameters were frozen.
  • A randomly initialized CTC head was attached.
  • Only the CTC head was trained.
  • Validation CTC loss was used to assess encoder quality.

This approach evaluates how useful the pretrained speech representations are for Khmer ASR without any encoder fine-tuning.


Results

Encoder Validation CTC Loss
Whisper Base 0.5663
WavLM Base 0.6031
Wav2Vec2 Base 0.7836

Ranking

πŸ₯‡ Whisper Base β€” 0.5663

πŸ₯ˆ WavLM Base β€” 0.6031

πŸ₯‰ Wav2Vec2 Base β€” 0.7836

Based on these experiments, Whisper Base produced the strongest transferable speech representations for Khmer ASR and was selected as the backbone for subsequent training stages.


Selected Backbone

{
  "best_backbone_key": "whisper-base",
  "best_backbone_id": "openai/whisper-base",
  "hidden_size": 512,
  "checkpoint_path": "./whisper-base_best.pt"
}

Best Checkpoint

whisper-base_best.pt

Training Datasets

The benchmark used a combination of multiple Khmer speech datasets.

1. Khmer GRKPP Speech

Dataset:

seanghay/khmer_grkpp_speech

2. KM Speech Corpus

Dataset:

seanghay/km-speech-corpus

3. FLEURS + OpenSLR42 + MPWT

Dataset:

KrorngAI/fleurs_openslr42_mpwt

For balanced experimentation, up to 3,000 samples per dataset were used during backbone evaluation.


Vocabulary

The model uses a character-level Khmer vocabulary containing:

  • Khmer consonants
  • Khmer vowels
  • Khmer diacritics
  • Khmer numerals
  • Khmer punctuation marks
  • Special CTC tokens

Special Tokens

[BLANK]
[PAD]
[MASK]
[UNK]

Vocabulary Size

108 tokens

Repository Contents

.
β”œβ”€β”€ whisper-base_best.pt
β”œβ”€β”€ README.md
└── benchmark metadata

Experimental Goal

This repository is not intended to be a production-ready ASR system.

Instead, it provides:

  • A Khmer ASR encoder benchmark
  • Pretrained CTC evaluation checkpoints
  • A comparison between Whisper, WavLM, and Wav2Vec2 representations
  • A foundation for future Khmer ASR research and fine-tuning

Key Findings

  • Whisper Base achieved the lowest validation CTC loss.
  • WavLM Base performed competitively and ranked second.
  • Wav2Vec2 Base showed weaker transferability on the evaluated Khmer datasets.
  • Frozen encoder evaluation provides an efficient way to compare speech backbones before expensive full-model training.

Future Work

Planned improvements include:

  • Full encoder fine-tuning
  • Larger Khmer speech datasets
  • Language model integration
  • Beam search decoding
  • Character Error Rate (CER) evaluation
  • Word Error Rate (WER) evaluation
  • Deployment-ready Khmer ASR models

Citation

If you use this repository in your research, please cite:

@misc{uk2026khmerasrbenchmark,
  title={Khmer ASR Encoder Benchmark and Pretrained Checkpoints},
  author={Uk, Panhapich},
  year={2026},
  publisher={Hugging Face},
  url={https://huggingface.co/Panhapich/pre-trained_whisper_wavLM}
}

Author

Panhapich Uk

Independent research project focused on:

  • Khmer Automatic Speech Recognition (ASR)
  • Speech representation learning
  • Low-resource language technologies
  • Self-supervised speech models

License

This project is released under the MIT License.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support