1
---
2
language: et
3
datasets:
4
- common_voice
5
- NST Estonian ASR Database
6
metrics:
7
- wer
8
- cer
9
tags:
10
- audio
11
- automatic-speech-recognition
12
- speech
13
- xlsr-fine-tuning-week
14
license: apache-2.0
15
model-index:
16
- name: XLSR Wav2Vec2 Large 53 - Estonian by Vasilis
17
  results:
18
  - task: 
19
      name: Speech Recognition
20
      type: automatic-speech-recognition
21
    dataset:
22
      name: Common Voice et
23
      type: common_voice
24
      args: et
25
    metrics:
26
       - name: Test WER
27
         type: wer
28
         value: 30.658320
29
       - name: Test CER
30
         type: cer
31
         value: 5.261490
32
---
33
34
# Wav2Vec2-Large-XLSR-53-Estonian
35
36
Fine-tuned [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) on Estonian using the [Common Voice](https://huggingface.co/datasets/common_voice).
37
When using this model, make sure that your speech input is sampled at 16kHz.
38
39
## Usage
40
41
The model can be used directly (without a language model) as follows:
42
43
```python
44
import torch
45
import torchaudio
46
from datasets import load_dataset
47
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
48
49
test_dataset = load_dataset("common_voice", "et", split="test[:2%]") #TODO: replace {lang_id} in your language code here. Make sure the code is one of the *ISO codes* of [this](https://huggingface.co/languages) site.
50
51
processor = Wav2Vec2Processor.from_pretrained("vasilis/wav2vec2-large-xlsr-53-Estonian") #TODO: replace {model_id} with your model id. The model id consists of {your_username}/{your_modelname}, *e.g.* `elgeish/wav2vec2-large-xlsr-53-arabic`
52
model = Wav2Vec2ForCTC.from_pretrained("vasilis/wav2vec2-large-xlsr-53-Estonian") #TODO: replace {model_id} with your model id. The model id consists of {your_username}/{your_modelname}, *e.g.* `elgeish/wav2vec2-large-xlsr-53-arabic`
53
54
resampler = torchaudio.transforms.Resample(48_000, 16_000)
55
56
# Preprocessing the datasets.
57
# We need to read the aduio files as arrays
58
def speech_file_to_array_fn(batch):
59
    speech_array, sampling_rate = torchaudio.load(batch["path"])
60
    batch["speech"] = resampler(speech_array).squeeze().numpy()
61
    return batch
62
63
test_dataset = test_dataset.map(speech_file_to_array_fn)
64
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)
65
66
with torch.no_grad():
67
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
68
69
predicted_ids = torch.argmax(logits, dim=-1)
70
71
print("Prediction:", processor.batch_decode(predicted_ids))
72
print("Reference:", test_dataset["sentence"][:2])
73
```
74
75
76
## Evaluation
77
78
The model can be evaluated as follows on the Estonian test data of Common Voice.
79
80
81
```python
82
import torch
83
import torchaudio
84
from datasets import load_dataset, load_metric
85
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
86
import re
87
88
test_dataset = load_dataset("common_voice", "et", split="test")
89
wer = load_metric("wer")
90
91
processor = Wav2Vec2Processor.from_pretrained("vasilis/wav2vec2-large-xlsr-53-Estonian")
92
model = Wav2Vec2ForCTC.from_pretrained("vasilis/wav2vec2-large-xlsr-53-Estonian")
93
model.to("cuda")
94
95
chars_to_ignore_regex = "[\,\?\.\!\-\;\:\"\“\%\‘\”\�\']"  # TODO: adapt this list to include all special characters you removed from the data
96
97
resampler = {
98
    48_000: torchaudio.transforms.Resample(48_000, 16_000),
99
    44100: torchaudio.transforms.Resample(44100, 16_000),
100
    32000: torchaudio.transforms.Resample(32000, 16_000)
101
}
102
103
# Preprocessing the datasets.
104
# We need to read the aduio files as arrays
105
def speech_file_to_array_fn(batch):
106
    batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()
107
    speech_array, sampling_rate = torchaudio.load(batch["path"])
108
    batch["speech"] = resampler[sampling_rate](speech_array).squeeze().numpy()
109
    return batch
110
111
test_dataset = test_dataset.map(speech_file_to_array_fn)
112
113
# Preprocessing the datasets.
114
# We need to read the aduio files as arrays
115
def evaluate(batch):
116
    inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
117
    with torch.no_grad():
118
        logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits
119
    pred_ids = torch.argmax(logits, dim=-1)
120
    batch["pred_strings"] = processor.batch_decode(pred_ids)
121
    return batch
122
123
result = test_dataset.map(evaluate, batched=True, batch_size=8)
124
125
print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))
126
print("CER: {:2f}".format(100 * wer.compute(predictions=[" ".join(list(entry)) for entry in result["pred_strings"]], references=[" ".join(list(entry)) for entry in result["sentence"]])))
127
128
```
129
130
**Test Result**:  30.658320 %
131
132
## Training
133
134
Common voice `train` and `validation` sets were used for finetuning
135
for 20000 steps (approx. 116 epochs).  Both the `feature extractor` (`Wav2Vec2FeatureExtractor`) and
136
`feature projection` (`Wav2Vec2FeatureProjection`) layer were frozen. Only the `encoder` layer (`Wav2Vec2EncoderStableLayerNorm`) was finetuned.  
137
138
139
140