techiaith
/

wav2vec2-base-cy

Automatic Speech Recognition

Inference Endpoints

Model card Files Files and versions

wav2vec2-base-cy / README.md

Language Technologies, Bangor University

Update README.md

8a93704 about 2 years ago

|

1.76 kB

	---
	license: apache-2.0
	---

	# Pre-training wav2vec2 models for Welsh speech recognition

	At the moment, the best Welsh speech recognition models are achieved from fine-tuning https://huggingface.co/facebook/wav2vec2-large-xlsr-53 and https://huggingface.co/facebook/wav2vec2-xls-r-1b models by Facebook/Meta AI.

	This model is experimental in investigating pretraining better models with more Welsh language speech that could lower WER scores even further in subsequently fine-tuned models. The work draws heavily on resources and documentation from the HuggingFace examples:

	https://github.com/huggingface/transformers/tree/main/examples/pytorch/speech-pretraining

	This initial base model has been pre-trained with scripts at

	https://github.com/techiaith/docker-wav2vec2-cy/tree/main/train/pre-train

	using English speech from LibriSpeech's minimal subsets (`validation` and `test`), and 184 hours and 47 minutes of Welsh speech from various playlists on YouTube. The script [`build_youtube_playlists_corpus.sh`](https://github.com/techiaith/docker-wav2vec2-cy/blob/main/inference/python/build_youtube_playlists_corpus.sh) lists the playlists used.

	Until we have collected thousands of hours of Welsh speech, rather than hundreds, the WER scores, after fine-tuning, will remain very high. The following WERs are from tests on a Welsh Common Voice test set as well a [second set of YouTube videos with corrected transcriptions](https://git.techiaith.bangor.ac.uk/data-porth-technolegau-iaith/corpws-profi-adnabod-lleferydd/-/tree/master/data/trawsgrifio).

	\| Test Set \| WER \| CER \| WER (+LM) \| CER (+LM)\|
	\| -------- \| --- \| --- \| --------- \| -------- \|
	\| CV CY 10 \| 94.83 \| 85.55 \| 92.31 \| 82.25 \|
	\| YouTube \| 95.43 \| 90.26 \| 93.60 \| 89.33 \|