File size: 2,892 Bytes
1ac22db
bd7057e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1ac22db
a5706d8
 
 
caf900a
3f87c4d
34edfbb
 
8072f41
 
 
 
3f87c4d
545584b
 
 
 
 
 
 
 
 
 
3f87c4d
 
649a76a
a5706d8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3687bc0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bf35794
3687bc0
 
 
55124d8
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
---
base_model: facebook/w2v-bert-2.0
datasets:
  - common_voice_10_0
metrics:
  - wer
model-index:
  - name: w2v-bert-2.0-uk
    results:
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: common_voice_10_0
          type: common_voice_10_0
          config: uk
          split: test
          args: uk
        metrics:
          - name: Wer
            type: wer
            value: 0.0655
---

# wav2vec2-bert-uk

🇺🇦 Join our **Discord server** - https://discord.gg/yVAjkBgmt4 - where we're talking about Data Science, Machine Learning, Deep Learning, and Artificial Intelligence

🇺🇦 Join our Speech Recognition Group in Telegram: https://t.me/speech_recognition_uk

## Google Colab

You can run this model using a Google Colab notebook: https://colab.research.google.com/drive/1QoKw2DWo5a5XYw870cfGE3dJf1WjZgrj?usp=sharing

## Metrics

- AM:
  - WER: 0.0727
  - CER: 0.0151
  - Accuracy: 92.73%
- AM + LM:
  - WER: 0.0655
  - CER: 0.0139
  - Accuracy: 93.45%

## Hyperparameters

This model was trained with the following hparams using 2 RTX A4000:

```
torchrun --standalone --nnodes=1 --nproc-per-node=2 ../train_w2v2_bert.py \
  --custom_set ~/cv10/train.csv \
  --custom_set_eval ~/cv10/test.csv \
  --num_train_epochs 15 \
  --tokenize_config . \
  --w2v2_bert_model facebook/w2v-bert-2.0 \
  --batch 4 \
  --num_proc 5 \
  --grad_accum 1 \
  --learning_rate 3e-5 \
  --logging_steps 20 \
  --eval_step 500 \
  --group_by_length \
  --attention_dropout 0.0 \
  --activation_dropout 0.05 \
  --feat_proj_dropout 0.05 \
  --feat_quantizer_dropout 0.0 \
  --hidden_dropout 0.05 \
  --layerdrop 0.0 \
  --final_dropout 0.0 \
  --mask_time_prob 0.0 \
  --mask_time_length 10 \
  --mask_feature_prob 0.0 \
  --mask_feature_length 10
```

## Usage

```python
# pip install -U torch soundfile transformers

import torch
import soundfile as sf
from transformers import AutoModelForCTC, Wav2Vec2BertProcessor

# Config
model_name = 'Yehor/w2v-bert-2.0-uk'
device = 'cuda:1' # or cpu
sampling_rate = 16_000

# Load the model
asr_model = AutoModelForCTC.from_pretrained(model_name).to(device)
processor = Wav2Vec2BertProcessor.from_pretrained(model_name)

paths = [
  'sample1.wav',
]

# Extract audio
audio_inputs = []
for path in paths:
  audio_input, _ = sf.read(path)
  audio_inputs.append(audio_input)

# Transcribe the audio
inputs = processor(audio_inputs, sampling_rate=sampling_rate).input_features
features = torch.tensor(inputs).to(device)

with torch.no_grad():
  logits = asr_model(features).logits

predicted_ids = torch.argmax(logits, dim=-1)
predictions = processor.batch_decode(predicted_ids)

# Log results
print('Predictions:')
print(predictions)
```

### Licenses

- Acoustic Model: Apache 2
- Language Model (from https://huggingface.co/Yehor/kenlm-ukrainian): cc-by-nc-sa-4.0