techiaith
/

wav2vec2-xlsr-53-ft-cy-en

Automatic Speech Recognition

Inference Endpoints

Model card Files Files and versions

wav2vec2-xlsr-53-ft-cy-en / README.md

Language Technologies, Bangor University

Update README.md

495b32e over 1 year ago

|

raw history blame

No virus

2.31 kB

	---
	language:
	- cy
	- en
	datasets:
	- common_voice
	metrics:
	- wer
	tags:
	- automatic-speech-recognition
	- speech
	license: apache-2.0
	model-index:
	- name: wav2vec2-xlsr-ft-en-cy
	results:
	- task:
	name: Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: Common Voice cy
	type: common_voice
	args: cy
	metrics:
	- name: Test WER
	type: wer
	value: 17.70%
	---

	# wav2vec2-xlsr-ft-en-cy

	A speech recognition acoustic model for Welsh and English, fine-tuned from [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) using English/Welsh balanced data derived from version 11 of their respective Common Voice datasets (https://commonvoice.mozilla.org/cy/datasets). Custom bilingual Common Voice train/dev and test splits were built using the scripts at https://github.com/techiaith/docker-commonvoice-custom-splits-builder#introduction

	Source code and scripts for training wav2vec2-xlsr-ft-en-cy can be found at [https://github.com/techiaith/docker-wav2vec2-cy](https://github.com/techiaith/docker-wav2vec2-cy/blob/main/train/fine-tune/python/run_en_cy.sh).



	## Usage

	The wav2vec2-xlsr-ft-en-cy model can be used directly as follows:

	```python
	import torch
	import torchaudio
	import librosa

	from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

	processor = Wav2Vec2Processor.from_pretrained("techiaith/wav2vec2-xlsr-ft-en-cy")
	model = Wav2Vec2ForCTC.from_pretrained("techiaith/wav2vec2-xlsr-ft-en-cy")

	audio, rate = librosa.load(audio_file, sr=16000)

	inputs = processor(audio, sampling_rate=16_000, return_tensors="pt", padding=True)

	with torch.no_grad():
	tlogits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

	# greedy decoding
	predicted_ids = torch.argmax(logits, dim=-1)

	print("Prediction:", processor.batch_decode(predicted_ids))

	```

	## Evaluation


	According to a balanced English+Welsh test set derived from Common Voice version 11, the WER of techiaith/wav2vec2-xlsr-ft-en-cy is 17.7%

	However, when evaluated with language specific test sets, the model exhibits a bias to perform better with Welsh.

	\| Common Voice Test Set Language \| WER \| CER \|
	\| -------- \| --- \| --- \|
	\| EN+CY \| 17.07\| 7.32 \|
	\| EN \| 27.54 \| 11.6 \|
	\| CY \| 7.13 \| 2.2 \|