---
language: ru
datasets:
- SberDevices/Golos
- common_voice
metrics:
- wer
- cer
tags:
- audio
- automatic-speech-recognition
- speech
- common_voice
- SberDevices/Golos
license: apache-2.0
widget:
- example_title: test Russian speech "нейросети это хорошо" (in English, "neural networks are good")
  src: https://huggingface.co/bond005/wav2vec2-large-ru-golos-with-lm/resolve/main/test_sound_ru.flac
model-index:
- name: XLSR Wav2Vec2 Russian with Language Model by Ivan Bondarenko
  results:
  - task:
      name: Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: Sberdevices Golos (crowd)
      type: SberDevices/Golos
      args: ru
    metrics:
       - name: Test WER
         type: wer
         value: 4.272
       - name: Test CER
         type: cer
         value: 0.983
  - task:
      name: Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: Sberdevices Golos (farfield)
      type: SberDevices/Golos
      args: ru
    metrics:
       - name: Test WER
         type: wer
         value: 11.405
       - name: Test CER
         type: cer
         value: 3.628
  - task:
      name: Automatic Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: Common Voice ru
      type: common_voice
      args: ru
    metrics:
    - name: Test WER
      type: wer
      value: 19.053
    - name: Test CER
      type: cer
      value: 4.876
---
# Wav2Vec2-Large-Ru-Golos-With-LM

The Wav2Vec2 model is based on [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53), fine-tuned in Russian using [Sberdevices Golos](https://huggingface.co/datasets/SberDevices/Golos) with audio augmentations like as pitch shift, acceleration/deceleration of sound, reverberation etc.

The 2-gram language model is built on the Russian text corpus obtained from three open sources:

- random 10% subset of [Taiga](https://tatianashavrina.github.io/taiga_site)
- [Russian Wikipedia](https://ru.wikipedia.org)
- [Russian Wikinews](https://ru.wikinews.org).

## Usage

When using this model, make sure that your speech input is sampled at 16kHz.

You can use this model by writing your own inference script:

```python
import os
import warnings

import librosa
import nltk
import numpy as np

import torch
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2ProcessorWithLM

MODEL_ID = "bond005/wav2vec2-large-ru-golos-with-lm"
DATASET_ID = "bond005/sberdevices_golos_10h_crowd"
SAMPLES = 20

nltk.download('punkt')
num_processes = max(1, os.cpu_count())

test_dataset = load_dataset(DATASET_ID, split=f"test[:{SAMPLES}]")
processor = Wav2Vec2ProcessorWithLM.from_pretrained(MODEL_ID)
model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
    speech_array = batch["audio"]["array"]
    batch["speech"] = np.asarray(speech_array, dtype=np.float32)
    return batch

removed_columns = set(test_dataset.column_names)
removed_columns -= {'transcription', 'speech'}
removed_columns = sorted(list(removed_columns))
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    test_dataset = test_dataset.map(
        speech_file_to_array_fn,
        num_proc=num_processes,
        remove_columns=removed_columns
    )

inputs = processor(test_dataset["speech"], sampling_rate=16_000,
                   return_tensors="pt", padding=True)
with torch.no_grad():
    logits = model(inputs.input_values,
                   attention_mask=inputs.attention_mask).logits
predicted_sentences = processor.batch_decode(
    logits=logits.numpy(),
    num_processes=num_processes
).text

with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    for i, predicted_sentence in enumerate(predicted_sentences):
        print("-" * 100)
        print("Reference:", test_dataset[i]["transcription"])
        print("Prediction:", predicted_sentence)
```

```text
----------------------------------------------------------------------------------------------------
Reference:  шестьдесят тысяч тенге сколько будет стоить
Prediction: шестьдесят тысяч тенге сколько будет стоить
----------------------------------------------------------------------------------------------------
Reference:  покажи мне на смотрешке телеканал синергия тв
Prediction: покажи мне на смотрешке телеканал синергия тв
----------------------------------------------------------------------------------------------------
Reference:  заказать яблоки зеленые
Prediction: заказать яблоки зеленые
----------------------------------------------------------------------------------------------------
Reference:  алиса закажи килограммовый торт графские развалины
Prediction: алиса закажи килограммовый торт графские развалины
----------------------------------------------------------------------------------------------------
Reference:  ищи телеканал про бизнес на тиви
Prediction: ищи телеканал про бизнес на тви
----------------------------------------------------------------------------------------------------
Reference:  михаила мурадяна
Prediction: михаила мурадяна
----------------------------------------------------------------------------------------------------
Reference:  любовницы две тысячи тринадцать пятнадцатый сезон
Prediction: любовница две тысячи тринадцать пятнадцатый сезон
----------------------------------------------------------------------------------------------------
Reference:  найди боевики
Prediction: найди боевики
----------------------------------------------------------------------------------------------------
Reference:  гетто сезон три
Prediction: гетта сезон три
----------------------------------------------------------------------------------------------------
Reference:  хочу посмотреть ростов папа на телевизоре
Prediction: хочу посмотреть ростов папа на телевизоре
----------------------------------------------------------------------------------------------------
Reference:  сбер какое твое самое ненавистное занятие
Prediction: сбер какое твое самое ненавистное занятие
----------------------------------------------------------------------------------------------------
Reference:  афина чем платят у китайцев
Prediction: афина чем платят у китайцев
----------------------------------------------------------------------------------------------------
Reference:  джой как работает досрочное погашение кредита
Prediction: джой как работает досрочное погашение кредита
----------------------------------------------------------------------------------------------------
Reference:  у тебя найдется люк кейдж
Prediction: у тебя найдется люк кейдж
----------------------------------------------------------------------------------------------------
Reference:  у тебя будет лучшая часть пинк
Prediction: у тебя будет лучшая часть пинк
----------------------------------------------------------------------------------------------------
Reference:  пожалуйста пополните мне счет
Prediction: пожалуйста пополните мне счет
----------------------------------------------------------------------------------------------------
Reference:  анне павловне шабуровой
Prediction: анне павловне шабуровой
----------------------------------------------------------------------------------------------------
Reference:  врубай на смотрешке муз тв
Prediction: врубай на смотрешке муз тиви
----------------------------------------------------------------------------------------------------
Reference:  найди на смотрешке лдпр тв
Prediction: найди на смотрешке лдпр тв
----------------------------------------------------------------------------------------------------
Reference:  сбер мне нужен педикюр забей мне место
Prediction: сбер мне нужен педикюр забелье место
```


The Google Colab version of [this script](https://colab.research.google.com/drive/1SnQmrt6HmMNV-zK-UCPajuwl1JvoCqbX?usp=sharing) is available too.

## Evaluation
This model was evaluated on the test subsets of [SberDevices Golos](https://huggingface.co/datasets/SberDevices/Golos) and [Common Voice 6.0](https://huggingface.co/datasets/common_voice) (Russian part), but it was trained on the training subset of SberDevices Golos only.

## Citation
If you want to cite this model you can use this:

```bibtex
@misc{bondarenko2022wav2vec2-large-ru-golos,
  title={XLSR Wav2Vec2 Russian with 2-gram Language Model by Ivan Bondarenko},
  author={Bondarenko, Ivan},
  publisher={Hugging Face},
  journal={Hugging Face Hub},
  howpublished={\url{https://huggingface.co/bond005/wav2vec2-large-ru-golos-with-lm}},
  year={2022}
}
```