Whisper large-v3-turbo-singlish

Whisper large-v3-turbo-singlish is a fine-tuned automatic speech recognition (ASR) model optimized for Singlish. Built on OpenAI's Whisper model, it has been adapted using Singlish-specific data to accurately capture the unique phonetic and lexical nuances of Singlish speech.

Model Details

Developed by: Ming Jie Wong
Base Model: openai/whisper-large-v3-turbo
Model Type: Encoder-decoder
Metrics: Word Error Rate (WER)
Languages Supported: English (with a focus on Singlish)
License: MIT

Description

Whisper large-v3-turbo-singlish is developed using an internal dataset of 66.9k audio-transcript pairs. The dataset is derived exclusively from the Part 3 Same Room Environment Close-talk Mic recordings of IMDA's NSC Corpus.

The original Part 3 of the National Speech Corpus comprises approximately 1,000 hours of conversational speech from around 1,000 local English speakers, recorded in pairs. These conversations cover everyday topics and include interactive game-based dialogues. Recordings were conducted in two environments:

Same Room, where speakers shared a room and were recorded using a close-talk mic and a boundary mic.
Separate Room, where each speaker was recorded individually using a standing mic and a telephone (IVR).

Audio segments for the internal dataset were extracted using these criteria:

Minimum Word Count: 10 words

This threshold was chosen to ensure that each audio segment contains sufficient linguistic context for the model to better understand instructions in Singlish. Shorter segments may bias the model towards specific utterances or phrases, limiting its overall comprehension.
Maximum Duration: 20 seconds

This threshold was chosen to provide enough context for accurate transcription while minimizing noise and computational complexity for longer audio segments.
Sampling Rate: All audio segments are down-sampled to 16kHz.

Full experiments details will be added soon.

Fine-Tuning Details

We applied fine-tuning on a single A100-80GB GPU.

Training Hyperparameters

The following hyperparameters are used:

batch_size: 16
gradient_accumulation_steps: 1
learning_rate: 1e-6
warmup_steps: 300
max_steps: 5000
fp16: true
eval_batch_size: 16
eval_step: 300
max_grad_norm: 1.0
generation_max_length: 225

Training Results

The table below summarizes the model’s progress across various training steps, showing the training loss, evaluation loss, and Word Error Rate (WER).

Steps	Train Loss	Eval Loss	WER
300	0.8992	0.3501	13.376788
600	0.4157	0.3241	12.769994
900	0.3520	0.3124	12.168367
1200	0.3415	0.3079	12.517532
1500	0.3620	0.3077	12.344057
1800	0.3609	0.2996	12.315267
2100	0.3348	0.2963	12.231113
2400	0.3715	0.2927	12.005226
2700	0.3445	0.2923	11.829537
3000	0.3753	0.2884	11.954291
3300	0.3469	0.2881	11.951338
3600	0.3325	0.2857	12.145483
3900	0.3168	0.2846	11.549023
4200	0.3250	0.2837	11.740215
4500	0.2855	0.2834	11.634654
4800	0.2936	0.2836	11.651632

The final checkpoint is taken from the model that achieved the lowest WER score during the 4800 steps.

Benchmark Performance

We evaluated Whisper large-v3-turbo-singlish on SASRBench-v1, a benchmark dataset for evaluating ASR performance on Singlish:

Model	WER
openai/whisper-small	147.80%
openai/whisper-large-v3	103.41%
jensenlwt/fine-tuned-122k-whisper-small	68.79%
openai/whisper-large-v3-turbo	27.58%
mjwong/whisper-small-singlish	18.49%
mjwong/whisper-large-v3-singlish	16.41%
mjwong/whisper-large-v3-turbo-singlish	13.35%

Disclaimer

While this model has been fine-tuned to better recognize Singlish, users may experience inaccuracies, biases, or unexpected outputs, particularly in challenging audio conditions or with speakers using non-standard variations. Use of this model is at your own risk; the developers and distributors are not liable for any consequences arising from its use. Please validate results before deploying in any sensitive or production environment.

How to use the model

The model can be loaded with the automatic-speech-recognition pipeline like so:

from transformers import pipeline
model = "mjwong/whisper-large-v3-turbo-singlish"
pipe = pipeline("automatic-speech-recognition", model)

You can then use this pipeline to transcribe audios of arbitrary length.

from datasets import load_dataset
dataset = load_dataset("mjwong/SASRBench-v1", split="test")
sample = dataset[0]["audio"]

result = pipe(sample)
print(result["text"])

Contact

For more information, please reach out to mingjwong@hotmail.com.

mjwong
/

whisper-large-v3-turbo-singlish

Whisper large-v3-turbo-singlish

Model Details

Description

Fine-Tuning Details

Training Hyperparameters

Training Results

Benchmark Performance

Disclaimer

How to use the model

Contact

Acknowledgements

Model tree for mjwong/whisper-large-v3-turbo-singlish

Space using mjwong/whisper-large-v3-turbo-singlish 1

Collection including mjwong/whisper-large-v3-turbo-singlish

Speak Good Singlish 🇸🇬

Evaluation results