File size: 5,559 Bytes
2752b4c 414e4b3 2752b4c 71cc0e5 111fa18 414e4b3 2eed5b6 414e4b3 2eed5b6 414e4b3 2eed5b6 414e4b3 2eed5b6 2752b4c 414e4b3 f890604 414e4b3 71cc0e5 8f82e5d 71cc0e5 8f82e5d 71cc0e5 2eed5b6 8f82e5d 71cc0e5 8f82e5d 71cc0e5 8f82e5d 71cc0e5 8f82e5d 71cc0e5 2eed5b6 414e4b3 f890604 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 |
---
language: ru
datasets:
- SberDevices/Golos
metrics:
- wer
- cer
tags:
- audio
- automatic-speech-recognition
- speech
- xlsr-fine-tuning-week
license: apache-2.0
widget:
- example_title: test sound with Russian speech
src: https://huggingface.co/bond005/wav2vec2-large-ru-golos/resolve/main/test_sound_ru.flac
model-index:
- name: XLSR Wav2Vec2 Russian by Ivan Bondarenko
results:
- task:
name: Speech Recognition
type: automatic-speech-recognition
dataset:
name: Sberdevices Golos (crowd)
type: SberDevices/Golos
args: ru
metrics:
- name: Test WER
type: wer
value: 5.860
- name: Test CER
type: cer
value: 1.228
- task:
name: Speech Recognition
type: automatic-speech-recognition
dataset:
name: Sberdevices Golos (farfield)
type: SberDevices/Golos
args: ru
metrics:
- name: Test WER
type: wer
value: 15.330
- name: Test CER
type: cer
value: 4.299
---
# Wav2Vec2-Large-Ru-Golos
The Wav2Vec2 model is based on [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53), fine-tuned in Russian using [Sberdevices Golos](https://huggingface.co/datasets/SberDevices/Golos) with audio augmentations like as pitch shift, acceleration/deceleration of sound, reverberation etc.
When using this model, make sure that your speech input is sampled at 16kHz.
## Usage
To transcribe audio files the model can be used as a standalone acoustic model as follows:
```python
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
from datasets import load_dataset
import torch
# load model and tokenizer
processor = Wav2Vec2Processor.from_pretrained("bond005/wav2vec2-large-ru-golos")
model = Wav2Vec2ForCTC.from_pretrained("bond005/wav2vec2-large-ru-golos")
# load the test part of Golos dataset and read first soundfile
ds = load_dataset("bond005/sberdevices_golos_10h_crowd", split="test")
# tokenize
processed = processor(ds[0]["audio"]["array"], return_tensors="pt", padding="longest") # Batch size 1
# retrieve logits
logits = model(processed.input_values, attention_mask=processed.attention_mask).logits
# take argmax and decode
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)[0]
print(transcription)
```
## Evaluation
This code snippet shows how to evaluate **bond005/wav2vec2-large-ru-golos** on Golos dataset's "crowd" and "farfield" test data.
```python
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import torch
from jiwer import wer, cer # we need word error rate (WER) and character error rate (CER)
# load the test part of Golos Crowd and remove samples with empty "true" transcriptions
golos_crowd_test = load_dataset("bond005/sberdevices_golos_10h_crowd", split="test")
golos_crowd_test = golos_crowd_test.filter(
lambda it1: (it1["transcription"] is not None) and (len(it1["transcription"].strip()) > 0)
)
# load the test part of Golos Farfield and remove sampels with empty "true" transcriptions
golos_farfield_test = load_dataset("bond005/sberdevices_golos_100h_farfield", split="test")
golos_farfield_test = golos_farfield_test.filter(
lambda it2: (it2["transcription"] is not None) and (len(it2["transcription"].strip()) > 0)
)
# load model and tokenizer
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h").to("cuda")
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")
# recognize one sound
def map_to_pred(batch):
# tokenize and vectorize
processed = processor(
batch["audio"]["array"], sampling_rate=batch["audio"]["sampling_rate"],
return_tensors="pt", padding="longest"
)
input_values = processed.input_values.to("cuda")
attention_mask = processed.attention_mask.to("cuda")
# recognize
with torch.no_grad():
logits = model(input_values, attention_mask=attention_mask).logits
predicted_ids = torch.argmax(logits, dim=-1)
# decode
transcription = processor.batch_decode(predicted_ids)
batch["text"] = transcription[0]
return batch
# calculate WER and CER on the crowd domain
crowd_result = golos_crowd_test.map(map_to_pred, remove_columns=["audio"])
crowd_wer = wer(crowd_result["transcription"], crowd_result["text"])
crowd_cer = cer(crowd_result["transcription"], crowd_result["text"])
print("Word error rate on the Crowd domain:", crowd_wer)
print("Character error rate on the Crowd domain:", crowd_cer)
# calculate WER and CER on the farfield domain
farfield_result = golos_farfield_test.map(map_to_pred, remove_columns=["audio"])
farfield_wer = wer(farfield_result["transcription"], farfield_result["text"])
farfield_cer = cer(farfield_result["transcription"], farfield_result["text"])
print("Word error rate on the Farfield domain:", farfield_wer)
print("Character error rate on the Farfield domain:", farfield_cer)
```
*Result (WER, %)*:
| "crowd" | "farfield" |
|---------|------------|
| 5.860 | 15.330 |
*Result (CER, %)*:
| "crowd" | "farfield" |
|---------|------------|
| 1.228 | 4.299 |
## Citation
If you want to cite this model you can use this:
```bibtex
@misc{bondarenko2022wav2vec2-large-ru-golos,
title={XLSR Wav2Vec2 Russian by Ivan Bondarenko},
author={Bondarenko, Ivan},
publisher={Hugging Face},
journal={Hugging Face Hub},
howpublished={\url{https://huggingface.co/bond005/wav2vec2-large-ru-golos}},
year={2022}
}
```
|