Edit model card

Distil-Whisper Small zh-HK - Alvin

  • This model is a distilled version of alvanlii/whisper-small-cantonese on the Cantonese language.
  • Achieves a 9.7 CER (without punctuations), 11.59 CER (with punctuations) on Common Voice 16.0.
  • Has 3 decoder layers instead of regular 12 of the Whisper small model.
  • Uses ~2GB of GPU VRAM

Training and evaluation data

For training,

  • CantoMap: Winterstein, Grégoire, Tang, Carmen and Lai, Regine (2020) "CantoMap: a Hong Kong Cantonese MapTask Corpus", in Proceedings of The 12th Language Resources and Evaluation Conference, Marseille: European Language Resources Association, p. 2899-2906.
  • Cantonse-ASR: Yu, Tiezheng, Frieske, Rita, Xu, Peng, Cahyawijaya, Samuel, Yiu, Cheuk Tung, Lovenia, Holy, Dai, Wenliang, Barezi, Elham, Chen, Qifeng, Ma, Xiaojuan, Shi, Bertram, Fung, Pascale (2022) "Automatic Speech Recognition Datasets in Cantonese: A Survey and New Dataset", 2022. Link: https://arxiv.org/pdf/2201.02419.pdf
  • Common Voice yue and zh-HK train sets

For evaluation, Common Voice 16.0 yue Test set is used.

Comparisons to Whisper Small

alvanlii/distil-whisper-small-cantonese alvanlii/whisper-small-cantonese
CER (lower is better) 0.097 0.089
GPU Inference time (sdpa) [s/sample] 0.027 0.055
GPU Inference (regular) [s/sample] 0.027 0.308
CPU Inference [s/sample] 1.3 2.57
Params [M] 157 242

Note: inference time is calculated by taking the average inference time for the CV16 yue test set

Using the Model

import librosa

import torch
from transformers import WhisperForConditionalGeneration, WhisperProcessor

y, sr = librosa.load('audio.mp3', sr=16000)

MODEL_NAME = "alvanlii/distil-whisper-small-cantonese"

processor = WhisperProcessor.from_pretrained(MODEL_NAME)
model = WhisperForConditionalGeneration.from_pretrained(MODEL_NAME)

model.config.forced_decoder_ids = None
model.config.suppress_tokens = []
model.config.use_cache = False

processed_in = processor(y, sampling_rate=sr, return_tensors="pt")
gout = model.generate(
  input_features=processed_in.input_features, 
  output_scores=True, return_dict_in_generate=True
)
transcription = processor.batch_decode(gout.sequences, skip_special_tokens=True)[0]
print(transcription)
  • Alternatively, you can use huggingface pipelines
from transformers import pipeline
MODEL_NAME = "alvanlii/distil-whisper-small-cantonese" 
lang = "zh"
pipe = pipeline(
    task="automatic-speech-recognition",
    model=MODEL_NAME,
    chunk_length_s=30,
    device=device,
)
pipe.model.config.forced_decoder_ids = pipe.tokenizer.get_decoder_prompt_ids(language=lang, task="transcribe")
text = pipe(file)["text"]
Downloads last month
56
Safetensors
Model size
157M params
Tensor type
F32
·
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Finetuned from

Dataset used to train alvanlii/distil-whisper-small-cantonese

Collection including alvanlii/distil-whisper-small-cantonese

Evaluation results