Edit model card

Wav2vec2-large-xlsr-cantonese

This model was based on wav2vec2-large-xlsr-53, finetuned using Common Voice/zh-HK/6.1.0.

The training code is similar to user ctl, except that the number of training epochs was 80 (doubled) and fp16_backend is apex. The model was trained using a single RTX 3090 and docker image is nvidia/cuda:11.1-cudnn8-devel.

CER is 15.11% when evaluate against common voice zh-HK test set.

Result (CER)

15.11%

Source Code

See this GitHub Repo cantonese-selfish-project and demo video.

Usage

import soundfile as sf
import torch
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

# load pretrained model
processor = Wav2Vec2Processor.from_pretrained("scottykwok/wav2vec2-large-xlsr-cantonese")
model = Wav2Vec2ForCTC.from_pretrained("scottykwok/wav2vec2-large-xlsr-cantonese")

# load audio - must be 16kHz mono
audio_input, sample_rate = sf.read('audio.wav')

# pad input values and return pt tensor
input_values = processor(audio_input, sampling_rate=sample_rate, return_tensors="pt").input_values

# INFERENCE
# retrieve logits & take argmax
logits = model(input_values).logits
predicted_ids = torch.argmax(logits, dim=-1)

# transcribe
transcription = processor.decode(predicted_ids[0])
print("-" *20)
print("Transcription:\n", transcription.lower())
print("-" *20)
Downloads last month
26
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train scottykwok/wav2vec2-large-xlsr-cantonese