File size: 2,736 Bytes
2752b4c
414e4b3
 
 
 
 
 
 
 
 
 
 
2752b4c
71cc0e5
 
5e06bac
414e4b3
 
 
 
 
 
 
 
 
 
 
 
 
9091f11
414e4b3
 
9091f11
414e4b3
 
 
 
 
 
 
 
 
 
9091f11
414e4b3
 
9091f11
2752b4c
414e4b3
 
 
f890604
 
414e4b3
 
71cc0e5
 
 
 
 
8f82e5d
 
 
71cc0e5
8f82e5d
 
 
71cc0e5
8f82e5d
 
71cc0e5
8f82e5d
 
71cc0e5
8f82e5d
 
71cc0e5
8f82e5d
 
 
 
 
71cc0e5
414e4b3
 
 
 
 
 
 
 
 
 
 
 
f890604
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
---
language: ru
datasets:
- SberDevices/Golos
metrics:
- wer
- cer
tags:
- audio
- automatic-speech-recognition
- speech
- xlsr-fine-tuning-week
license: apache-2.0
widget:
- example_title: test sound with Russian speech
  src: https://huggingface.co/bond005/wav2vec2-large-ru-golos/blob/main/test_sound_ru.flac
model-index:
- name: XLSR Wav2Vec2 Russian by Ivan Bondarenko
  results:
  - task:
      name: Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: Sberdevices Golos (crowd)
      type: SberDevices/Golos
      args: ru
    metrics:
       - name: Test WER
         type: wer
         value: 6.358
       - name: Test CER
         type: cer
         value: 1.711
  - task:
      name: Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: Sberdevices Golos (farfield)
      type: SberDevices/Golos
      args: ru
    metrics:
       - name: Test WER
         type: wer
         value: 15.402
       - name: Test CER
         type: cer
         value: 4.315
---

# Wav2Vec2-Large-Ru-Golos

The Wav2Vec2 model is based on [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53), fine-tuned in Russian using [Sberdevices Golos](https://huggingface.co/datasets/SberDevices/Golos) with audio augmentations like as pitch shift, acceleration/deceleration of sound, reverberation etc.

When using this model, make sure that your speech input is sampled at 16kHz.

## Usage

To transcribe audio files the model can be used as a standalone acoustic model as follows:

```python
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
from datasets import load_dataset
import torch
 
# load model and tokenizer
processor = Wav2Vec2Processor.from_pretrained("bond005/wav2vec2-large-ru-golos")
model = Wav2Vec2ForCTC.from_pretrained("bond005/wav2vec2-large-ru-golos")
     
# load test part of Golos dataset and read first soundfile
ds = load_dataset("bond005/sberdevices_golos_10h_crowd", split="test")
 
# tokenize
processed = processor(ds[0]["audio"]["array"], return_tensors="pt", padding="longest")  # Batch size 1
 
# retrieve logits
logits = model(processed.input_values, attention_mask=processed.attention_mask).logits
 
# take argmax and decode
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)[0]
print(transcription)
```

## Citation
If you want to cite this model you can use this:

```bibtex
@misc{bondarenko2022wav2vec2-large-ru-golos,
  title={XLSR Wav2Vec2 Russian by Ivan Bondarenko},
  author={Bondarenko, Ivan},
  publisher={Hugging Face},
  journal={Hugging Face Hub},
  howpublished={\url{https://huggingface.co/bond005/wav2vec2-large-ru-golos}},
  year={2022}
}
```