metadata

license: apache-2.0
language:
  - cy
tags:
  - speech

Pre-training wav2vec2 models for Welsh speech recognition

At the moment, the best Welsh speech recognition models are achieved from fine-tuning https://huggingface.co/facebook/wav2vec2-large-xlsr-53 and https://huggingface.co/facebook/wav2vec2-xls-r-1b models by Facebook/Meta AI.

This model is experimental in investigating pretraining better models with more Welsh language speech that could lower WER scores even further in subsequently fine-tuned models. The work draws heavily on resources and documentation from the HuggingFace examples:

https://github.com/huggingface/transformers/tree/main/examples/pytorch/speech-pretraining

This initial base model has been pre-trained with scripts at

https://github.com/techiaith/docker-wav2vec2-cy/tree/main/train/pre-train

using English speech from LibriSpeech's minimal subsets (validation and test), and 184 hours and 47 minutes of Welsh speech from various playlists on YouTube. The script build_youtube_playlists_corpus.sh lists the playlists used.

Until we have collected thousands of hours of Welsh speech, rather than hundreds, the WER scores, after fine-tuning, will remain very high. The following WERs are from tests on a Welsh Common Voice test set as well a second set of YouTube videos with corrected transcriptions.

Test Set	WER	CER	WER (+LM)	CER (+LM)
CV CY 10	94.83	85.55	92.31	82.25
YouTube	95.43	90.26	93.60	89.33