File size: 7,111 Bytes
2752b4c
414e4b3
 
 
a3d7538
 
414e4b3
 
 
 
 
 
 
 
2752b4c
71cc0e5
a3d7538
111fa18
414e4b3
 
 
 
 
 
 
 
 
 
 
 
 
a3d7538
414e4b3
 
a3d7538
414e4b3
 
 
 
 
 
 
 
 
 
a3d7538
414e4b3
 
a3d7538
baa8068
 
 
 
 
 
 
 
 
 
a3d7538
baa8068
 
a3d7538
baa8068
 
 
 
 
 
 
 
 
 
a3d7538
baa8068
 
a3d7538
baa8068
 
 
 
 
 
 
 
 
 
a3d7538
baa8068
 
a3d7538
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2752b4c
414e4b3
 
 
f890604
 
414e4b3
 
71cc0e5
 
 
 
 
8f82e5d
 
 
71cc0e5
8f82e5d
 
 
71cc0e5
2eed5b6
8f82e5d
71cc0e5
8f82e5d
 
71cc0e5
8f82e5d
 
71cc0e5
8f82e5d
 
 
 
 
71cc0e5
2eed5b6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a3d7538
2eed5b6
 
 
 
 
a3d7538
baa8068
 
2eed5b6
414e4b3
 
 
 
 
 
 
 
 
 
 
 
f890604
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
---
language: ru
datasets:
- SberDevices/Golos
- bond005/sova_rudevices
- bond005/rulibrispeech
metrics:
- wer
- cer
tags:
- audio
- automatic-speech-recognition
- speech
- xlsr-fine-tuning-week
license: apache-2.0
widget:
- example_title: test sound with Russian speech "нейросети это хорошо"
  src: https://huggingface.co/bond005/wav2vec2-large-ru-golos/resolve/main/test_sound_ru.flac
model-index:
- name: XLSR Wav2Vec2 Russian by Ivan Bondarenko
  results:
  - task:
      name: Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: Sberdevices Golos (crowd)
      type: SberDevices/Golos
      args: ru
    metrics:
       - name: Test WER
         type: wer
         value: 10.144
       - name: Test CER
         type: cer
         value: 2.168
  - task:
      name: Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: Sberdevices Golos (farfield)
      type: SberDevices/Golos
      args: ru
    metrics:
       - name: Test WER
         type: wer
         value: 20.353
       - name: Test CER
         type: cer
         value: 6.030
  - task:
      name: Automatic Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: Common Voice ru
      type: common_voice
      args: ru
    metrics:
      - name: Test WER
        type: wer
        value: 18.548
      - name: Test CER
        type: cer
        value: 4.000
  - task:
      name: Automatic Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: Sova RuDevices
      type: bond005/sova_rudevices
      args: ru
    metrics:
      - name: Test WER
        type: wer
        value: 25.410
      - name: Test CER
        type: cer
        value: 7.965
  - task:
      name: Automatic Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: Russian Librispeech
      type: bond005/rulibrispeech
      args: ru
    metrics:
      - name: Test WER
        type: wer
        value: 21.872
      - name: Test CER
        type: cer
        value: 4.469
  - task:
      name: Automatic Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: Voxforge Ru
      type: dangrebenkin/voxforge-ru-dataset
      args: ru
    metrics:
      - name: Test WER
        type: wer
        value: 27.084
      - name: Test CER
        type: cer
        value: 6.986
---

# Wav2Vec2-Large-Ru-Golos

The Wav2Vec2 model is based on [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53), fine-tuned in Russian using [Sberdevices Golos](https://huggingface.co/datasets/SberDevices/Golos) with audio augmentations like as pitch shift, acceleration/deceleration of sound, reverberation etc.

When using this model, make sure that your speech input is sampled at 16kHz.

## Usage

To transcribe audio files the model can be used as a standalone acoustic model as follows:

```python
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
from datasets import load_dataset
import torch
 
# load model and tokenizer
processor = Wav2Vec2Processor.from_pretrained("bond005/wav2vec2-large-ru-golos")
model = Wav2Vec2ForCTC.from_pretrained("bond005/wav2vec2-large-ru-golos")
     
# load the test part of Golos dataset and read first soundfile
ds = load_dataset("bond005/sberdevices_golos_10h_crowd", split="test")
 
# tokenize
processed = processor(ds[0]["audio"]["array"], return_tensors="pt", padding="longest")  # Batch size 1
 
# retrieve logits
logits = model(processed.input_values, attention_mask=processed.attention_mask).logits
 
# take argmax and decode
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)[0]
print(transcription)
```

 ## Evaluation
 
This code snippet shows how to evaluate **bond005/wav2vec2-large-ru-golos** on Golos dataset's "crowd" and "farfield" test data.
 
```python
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import torch
from jiwer import wer, cer  # we need word error rate (WER) and character error rate (CER)

# load the test part of Golos Crowd and remove samples with empty "true" transcriptions
golos_crowd_test = load_dataset("bond005/sberdevices_golos_10h_crowd", split="test")
golos_crowd_test = golos_crowd_test.filter(
    lambda it1: (it1["transcription"] is not None) and (len(it1["transcription"].strip()) > 0)
)

# load the test part of Golos Farfield and remove sampels with empty "true" transcriptions
golos_farfield_test = load_dataset("bond005/sberdevices_golos_100h_farfield", split="test")
golos_farfield_test = golos_farfield_test.filter(
    lambda it2: (it2["transcription"] is not None) and (len(it2["transcription"].strip()) > 0)
)

# load model and tokenizer
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h").to("cuda")
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")

# recognize one sound
def map_to_pred(batch):
    # tokenize and vectorize
    processed = processor(
        batch["audio"]["array"], sampling_rate=batch["audio"]["sampling_rate"],
        return_tensors="pt", padding="longest"
    )
    input_values = processed.input_values.to("cuda")
    attention_mask = processed.attention_mask.to("cuda")

    # recognize
    with torch.no_grad():
        logits = model(input_values, attention_mask=attention_mask).logits
    predicted_ids = torch.argmax(logits, dim=-1)

    # decode
    transcription = processor.batch_decode(predicted_ids)
    batch["text"] = transcription[0]
    return batch

# calculate WER and CER on the crowd domain
crowd_result = golos_crowd_test.map(map_to_pred, remove_columns=["audio"])
crowd_wer = wer(crowd_result["transcription"], crowd_result["text"])
crowd_cer = cer(crowd_result["transcription"], crowd_result["text"])
print("Word error rate on the Crowd domain:", crowd_wer)
print("Character error rate on the Crowd domain:", crowd_cer)

# calculate WER and CER on the farfield domain
farfield_result = golos_farfield_test.map(map_to_pred, remove_columns=["audio"])
farfield_wer = wer(farfield_result["transcription"], farfield_result["text"])
farfield_cer = cer(farfield_result["transcription"], farfield_result["text"])
print("Word error rate on the Farfield domain:", farfield_wer)
print("Character error rate on the Farfield domain:", farfield_cer)
```

*Result (WER, %)*:

| "crowd" | "farfield" |
|---------|------------|
|  10.144 |     20.353 |

*Result (CER, %)*:

| "crowd" | "farfield" |
|---------|------------|
|   2.168 |      6.030 |

You can see the evaluation script on other datasets, including Russian Librispeech and SOVA RuDevices, on my Kaggle web-page https://www.kaggle.com/code/bond005/wav2vec2-ru-eval

## Citation
If you want to cite this model you can use this:

```bibtex
@misc{bondarenko2022wav2vec2-large-ru-golos,
  title={XLSR Wav2Vec2 Russian by Ivan Bondarenko},
  author={Bondarenko, Ivan},
  publisher={Hugging Face},
  journal={Hugging Face Hub},
  howpublished={\url{https://huggingface.co/bond005/wav2vec2-large-ru-golos}},
  year={2022}
}
```