This model was based on wav2vec2-large-xlsr-53, finetuned using Common Voice/zh-HK/6.1.0.

The training code is similar to user ctl, except that the number of training epochs was 80 (doubled) and fp16_backend is apex. The model was trained using a single RTX 3090 and docker image is nvidia/cuda:11.1-cudnn8-devel.

CER is 15.11% when evaluate against common voice zh-HK test set.

Result (CER)


Source Code

See this GitHub Repo cantonese-selfish-project and demo video.


import soundfile as sf
import torch
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

# load pretrained model
processor = Wav2Vec2Processor.from_pretrained("scottykwok/wav2vec2-large-xlsr-cantonese")
model = Wav2Vec2ForCTC.from_pretrained("scottykwok/wav2vec2-large-xlsr-cantonese")

# load audio - must be 16kHz mono
audio_input, sample_rate = sf.read('audio.wav')

# pad input values and return pt tensor
input_values = processor(audio_input, sampling_rate=sample_rate, return_tensors="pt").input_values

# retrieve logits & take argmax
logits = model(input_values).logits
predicted_ids = torch.argmax(logits, dim=-1)

# transcribe
transcription = processor.decode(predicted_ids[0])
print("-" *20)
print("Transcription:\n", transcription.lower())
print("-" *20)
Downloads last month
Hosted inference API
Automatic Speech Recognition
This model can be loaded on the Inference API on-demand.