File size: 2,915 Bytes
1804c40
 
 
 
 
ee4472b
 
 
 
b46e348
 
211e970
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ee4472b
 
aefe5ec
b46e348
402e90c
b46e348
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
02e0324
 
 
 
2ccf153
b46e348
 
7335025
02e0324
 
 
27eee85
02e0324
a3dadc7
 
02e0324
06c0076
02e0324
 
 
5fb5b17
 
 
8808411
02e0324
 
 
15ad1d2
02e0324
15ad1d2
 
 
 
 
9ef11e9
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
---
license: cc-by-4.0
language:
- mk
library_name: speechbrain
metrics:
- wer
- cer
pipeline_tag: automatic-speech-recognition
base_model:
- jonatasgrosman/wav2vec2-large-xlsr-53-russian

model-index:
  - name: wav2vec2-aed-macedonian-asr
    results:
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: Macedonian Common Voice V.18.0
          type: macedonian-common-voice-v.18.0
        metrics:
          - name: Test WER
            type: test-wer
            value: 5.66
          - name: Test CER
            type: test-cer
            value: 1.43

---

# Fine-tuned XLSR-53-russian large model for speech recognition in Macedonian

Authors:
1. Dejan Porjazovski
2. Ilina Jakimovska
3. Ordan Chukaliev
4. Nikola Stikov

This collaboration is part of the activities of the Center for Advanced Interdisciplinary Research (CAIR) at UKIM.

## Data used for training

In training of the model, we used the following data sources:
1. Digital Archive for Ethnological and Anthropological Resources (DAEAR) at the Institutе of Ethnology and Anthropology, PMF, UKIM.
2. Audio version of the international journal "EthnoAnthropoZoom" at the Institutе of Ethnology and Anthropology, PMF, UKIM.
3. The podcast "Обични луѓе" by Ilina Jakimovska.
4. The scientific videos from the series "Наука за деца", foundation KANTAROT.
5. Macedonian version of the Mozilla Common Voice (version 18).


## Model description

This model is an attention-based encoder-decoder (AED). The encoder is a Wav2vec2 model and the decoder is RNN-based.


## Usage

The model is developed using the [SpeechBrain](https://speechbrain.github.io) toolkit. To use it, you need to install SpeechBrain with:
```
pip install speechbrain
```
SpeechBrain relies on the Transformers library, therefore you need install it:
```
pip install transformers
```

An external `py_module_file=custom_interface.py` is used as an external Predictor class into this HF repos. We use the `foreign_class` function from `speechbrain.pretrained.interfaces` that allows you to load your custom model. 

```python
from speechbrain.inference.interfaces import foreign_class
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
asr_classifier = foreign_class(source="Macedonian-ASR/wav2vec2-aed-macedonian-asr", pymodule_file="custom_interface.py", classname="ASR")
asr_classifier = asr_classifier.to(device)
predictions = asr_classifier.classify_file("audio_file.wav", device)
print(predictions)
```

## Training

To fine-tune this model, you need to run:
```
python train.py hyperparams.yaml
```

```train.py``` file contains the functions necessary for training the model and ```hyperparams.yaml``` contains the hyperparameters. For more details about training the model, refer to the [SpeechBrain](https://speechbrain.github.io) documentation.