1
---
2
language: ga
3
datasets:
4
- common_voice
5
metrics:
6
- wer
7
tags:
8
- audio
9
- automatic-speech-recognition
10
- speech
11
- xlsr-fine-tuning-week
12
license: apache-2.0
13
model-index:
14
- name: XLSR Wav2Vec2 Irish by Jim O'Regan
15
  results:
16
  - task:
17
      name: Speech Recognition
18
      type: automatic-speech-recognition
19
    dataset:
20
      name: Common Voice ga-IE
21
      type: common_voice
22
      args: ga-IE
23
    metrics:
24
    - name: Test WER
25
      type: wer
26
      value: 47.4
27
---
28
# Wav2Vec2-Large-XLSR-Irish
29
Fine-tuned [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53)
30
on the [Irish Common Voice dataset](https://huggingface.co/datasets/common_voice).
31
When using this model, make sure that your speech input is sampled at 16kHz.
32
## Usage
33
The model can be used directly (without a language model) as follows:
34
```python
35
import torch
36
import torchaudio
37
from datasets import load_dataset
38
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
39
test_dataset = load_dataset("common_voice", "ga-IE", split="test[:2%]")
40
processor = Wav2Vec2Processor.from_pretrained("jimregan/wav2vec2-large-xlsr-irish-basic")
41
model = Wav2Vec2ForCTC.from_pretrained("jimregan/wav2vec2-large-xlsr-irish-basic")
42
resampler = torchaudio.transforms.Resample(48_000, 16_000)
43
# Preprocessing the datasets.
44
# We need to read the audio files as arrays
45
def speech_file_to_array_fn(batch):
46
	speech_array, sampling_rate = torchaudio.load(batch["path"])
47
	batch["speech"] = resampler(speech_array).squeeze().numpy()
48
	return batch
49
test_dataset = test_dataset.map(speech_file_to_array_fn)
50
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)
51
with torch.no_grad():
52
	logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
53
predicted_ids = torch.argmax(logits, dim=-1)
54
print("Prediction:", processor.batch_decode(predicted_ids))
55
print("Reference:", test_dataset["sentence"][:2])
56
```
57
## Evaluation
58
The model can be evaluated as follows on the Irish test data of Common Voice.
59
```python
60
import torch
61
import torchaudio
62
from datasets import load_dataset, load_metric
63
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
64
import re
65
test_dataset = load_dataset("common_voice", "ga-IE", split="test")
66
wer = load_metric("wer")
67
processor = Wav2Vec2Processor.from_pretrained("jimregan/wav2vec2-large-xlsr-irish-basic")
68
model = Wav2Vec2ForCTC.from_pretrained("jimregan/wav2vec2-large-xlsr-irish-basic") 
69
model.to("cuda")
70
# So, tolower() for Irish is a bit complicated: tAthar -> t-athair
71
# toupper() is non-deterministic :)
72
def is_upper_vowel(letter):
73
    if letter in ['A', 'E', 'I', 'O', 'U', 'Á', 'É', 'Í', 'Ó', 'Ú']:
74
        return True
75
    else:
76
        return False
77
def irish_lower(word):
78
    if len(word) > 1 and word[0] in ['n', 't'] and is_upper_vowel(word[1]):
79
        return word[0] + '-' + word[1:].lower()
80
    else:
81
        return word.lower()
82
def irish_lower_sentence(sentence):
83
    return " ".join([irish_lower(w) for w in sentence.split(" ")])
84
chars_to_ignore_regex = '[,\?\.\!\;\:\"\“\%\‘\”\(\)\*]'
85
def remove_special_characters(sentence):
86
    tmp = re.sub('’ ', ' ', sentence)
87
    tmp = re.sub("’$", '', tmp)
88
    tmp = re.sub('’', '\'', tmp)
89
    tmp = re.sub(chars_to_ignore_regex, '', tmp)
90
    sentence = irish_lower_sentence(tmp) + ' '
91
    return sentence
92
resampler = torchaudio.transforms.Resample(48_000, 16_000)
93
# Preprocessing the datasets.
94
# We need to read the audio files as arrays
95
def speech_file_to_array_fn(batch):
96
    batch["sentence"] = remove_special_characters(batch["sentence"])
97
    speech_array, sampling_rate = torchaudio.load(batch["path"])
98
    batch["speech"] = resampler(speech_array).squeeze().numpy()
99
    return batch
100
test_dataset = test_dataset.map(speech_file_to_array_fn)
101
# Preprocessing the datasets.
102
# We need to read the audio files as arrays
103
def evaluate(batch):
104
    inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
105
    with torch.no_grad():
106
        logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits    
107
    pred_ids = torch.argmax(logits, dim=-1)
108
    batch["pred_strings"] = processor.batch_decode(pred_ids)
109
    return batch
110
result = test_dataset.map(evaluate, batched=True, batch_size=8)
111
print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))
112
```
113
**Test Result**: 43.7 %
114
115
## Training
116
117
The Common Voice `train` and `validation` datasets were used for training.
118
119
The script used for training can be found [here](https://github.com/jimregan/wav2vec2-sprint/blob/main/irish/fine-tune-xlsr-wav2vec2-on-irish-asr-with-transformers.ipynb)
120