Anton Lozhkov
Fix eval script 5bee55f
1
---
2
language: lt
3
datasets:
4
- common_voice
5
metrics:
6
- wer
7
tags:
8
- audio
9
- automatic-speech-recognition
10
- speech
11
- xlsr-fine-tuning-week
12
license: apache-2.0
13
model-index:
14
- name: Lithuanian XLSR Wav2Vec2 Large 53 by Anton Lozhkov
15
  results:
16
  - task: 
17
      name: Speech Recognition
18
      type: automatic-speech-recognition
19
    dataset:
20
      name: Common Voice lt
21
      type: common_voice
22
      args: lt
23
    metrics:
24
       - name: Test WER
25
         type: wer
26
         value: 49.00
27
---
28
29
# Wav2Vec2-Large-XLSR-53-Lithuanian
30
31
Fine-tuned [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) on Lithuanian using the [Common Voice](https://huggingface.co/datasets/common_voice) dataset.
32
When using this model, make sure that your speech input is sampled at 16kHz.
33
34
## Usage
35
36
The model can be used directly (without a language model) as follows:
37
38
```python
39
import torch
40
import torchaudio
41
from datasets import load_dataset
42
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
43
44
test_dataset = load_dataset("common_voice", "lt", split="test[:2%]")
45
46
processor = Wav2Vec2Processor.from_pretrained("anton-l/wav2vec2-large-xlsr-53-lithuanian")
47
model = Wav2Vec2ForCTC.from_pretrained("anton-l/wav2vec2-large-xlsr-53-lithuanian")
48
49
resampler = torchaudio.transforms.Resample(48_000, 16_000)
50
51
# Preprocessing the datasets.
52
# We need to read the audio files as arrays
53
def speech_file_to_array_fn(batch):
54
    speech_array, sampling_rate = torchaudio.load(batch["path"])
55
    batch["speech"] = resampler(speech_array).squeeze().numpy()
56
    return batch
57
58
test_dataset = test_dataset.map(speech_file_to_array_fn)
59
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)
60
61
with torch.no_grad():
62
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
63
64
predicted_ids = torch.argmax(logits, dim=-1)
65
66
print("Prediction:", processor.batch_decode(predicted_ids))
67
print("Reference:", test_dataset["sentence"][:2])
68
```
69
70
71
## Evaluation
72
73
The model can be evaluated as follows on the Lithuanian test data of Common Voice.
74
75
```python
76
import torch
77
import torchaudio
78
import urllib.request
79
import tarfile
80
import pandas as pd
81
from tqdm.auto import tqdm
82
from datasets import load_metric
83
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
84
85
# Download the raw data instead of using HF datasets to save disk space 
86
data_url = "https://voice-prod-bundler-ee1969a6ce8178826482b88e843c335139bd3fb4.s3.amazonaws.com/cv-corpus-6.1-2020-12-11/lt.tar.gz"
87
filestream = urllib.request.urlopen(data_url)
88
data_file = tarfile.open(fileobj=filestream, mode="r|gz")
89
data_file.extractall()
90
91
wer = load_metric("wer")
92
93
processor = Wav2Vec2Processor.from_pretrained("anton-l/wav2vec2-large-xlsr-53-lithuanian")
94
model = Wav2Vec2ForCTC.from_pretrained("anton-l/wav2vec2-large-xlsr-53-lithuanian")
95
model.to("cuda")
96
97
cv_test = pd.read_csv("cv-corpus-6.1-2020-12-11/lt/test.tsv", sep='\t')
98
clips_path = "cv-corpus-6.1-2020-12-11/lt/clips/"
99
100
def clean_sentence(sent):
101
    sent = sent.lower()
102
    # normalize apostrophes
103
    sent = sent.replace("’", "'")
104
    # replace non-alpha characters with space
105
    sent = "".join(ch if ch.isalpha() or ch == "'" else " " for ch in sent)
106
    # remove repeated spaces
107
    sent = " ".join(sent.split())
108
    return sent
109
110
targets = []
111
preds = []
112
113
for i, row in tqdm(cv_test.iterrows(), total=cv_test.shape[0]):
114
    row["sentence"] = clean_sentence(row["sentence"])
115
    speech_array, sampling_rate = torchaudio.load(clips_path + row["path"])
116
    resampler = torchaudio.transforms.Resample(sampling_rate, 16_000)
117
    row["speech"] = resampler(speech_array).squeeze().numpy()
118
119
    inputs = processor(row["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
120
121
    with torch.no_grad():
122
        logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits
123
124
    pred_ids = torch.argmax(logits, dim=-1)
125
126
    targets.append(row["sentence"])
127
    preds.append(processor.batch_decode(pred_ids)[0])
128
129
print("WER: {:2f}".format(100 * wer.compute(predictions=preds, references=targets)))
130
```
131
132
**Test Result**: 49.00 %  
133
134
135
## Training
136
137
The Common Voice `train` and `validation` datasets were used for training.
138
139