--- language: zh tags: - automatic-speech-recognition license: cc-by-sa-4.0 datasets: - common_voice metrics: - cer --- # Wav2vec2-large-xlsr-cantonese This model was based on [wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53), finetuned using Common Voice/zh-HK/6.1.0. The training code is similar to [user ctl](https://huggingface.co/ctl/wav2vec2-large-xlsr-cantonese), except that the number of training epochs was 80 (doubled) and fp16_backend is apex. The model was trained using a single RTX 3090 and docker image is nvidia/cuda:11.1-cudnn8-devel. CER is 15.11% when evaluate against common voice zh-HK test set. # Result (CER) 15.11% # Source Code See this GitHub Repo [cantonese-selfish-project](https://github.com/scottykwok/cantonese-selfish-project/) and [demo video](https://youtu.be/k_9RQ-ilGEc). # Usage ```python import soundfile as sf import torch from datasets import load_dataset from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor # load pretrained model processor = Wav2Vec2Processor.from_pretrained("scottykwok/wav2vec2-large-xlsr-cantonese") model = Wav2Vec2ForCTC.from_pretrained("scottykwok/wav2vec2-large-xlsr-cantonese") # load audio - must be 16kHz mono audio_input, sample_rate = sf.read('audio.wav') # pad input values and return pt tensor input_values = processor(audio_input, sampling_rate=sample_rate, return_tensors="pt").input_values # INFERENCE # retrieve logits & take argmax logits = model(input_values).logits predicted_ids = torch.argmax(logits, dim=-1) # transcribe transcription = processor.decode(predicted_ids[0]) print("-" *20) print("Transcription:\n", transcription.lower()) print("-" *20) ```