File size: 3,329 Bytes
75005c5
c7bf489
1b982bc
75005c5
 
1b982bc
75005c5
 
 
 
 
 
 
febadb9
d99950e
75005c5
 
 
d99950e
75005c5
 
 
 
 
 
 
d99950e
 
8da40f8
a4a72f4
ce279e0
af6d443
8afb811
 
 
0e352e8
92ec9ff
9f474d0
c9b8f60
92ec9ff
94327cf
4818eb2
 
af930d2
4818eb2
7446449
ce279e0
7446449
 
 
 
 
 
 
 
 
 
 
6a2760b
3ff4471
7446449
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1b982bc
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
---
language: sv
arxiv: https://arxiv.org/abs/2205.03026
datasets:
- common_voice
- NST_Swedish_ASR_Database
- P4
metrics:
- wer
tags:
- audio
- automatic-speech-recognition
- speech
- hf-asr-leaderboard
license: cc0-1.0
model-index:
- name: Wav2vec 2.0 large VoxRex Swedish
  results:
  - task:
      name: Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: Common Voice
      type: common_voice
      args: sv-SE
    metrics:
    - name: Test WER
      type: wer
      value: 8.49
---
# Wav2vec 2.0 large VoxRex Swedish (C)

Finetuned version of KBs [VoxRex large](https://huggingface.co/KBLab/wav2vec2-large-voxrex) model using Swedish radio broadcasts, NST and Common Voice data. Evalutation without a language model gives the following: WER for NST + Common Voice test set (2% of total sentences) is **2.5%**. WER for Common Voice test set is **8.49%** directly and **7.37%** with a 4-gram language model.

When using this model, make sure that your speech input is sampled at 16kHz.

**Update 2022-01-10:** Updated to VoxRex-C version.

**Update 2022-05-16:** Paper is is [here](https://arxiv.org/abs/2205.03026).

# Performance\*

![Comparison](comparison.png "Comparison")
<center><del>*<i>Chart shows performance without the additional 20k steps of Common Voice fine-tuning</i></del></center>

## Training
This model has been fine-tuned for 120000 updates on NST + CommonVoice<del> and then for an additional 20000 updates on CommonVoice only. The additional fine-tuning on CommonVoice hurts performance on the NST+CommonVoice test set somewhat and, unsurprisingly, improves it on the CommonVoice test set. It seems to perform generally better though [citation needed]</del>.

![WER during training](chart_1.svg "WER")

## Usage
The model can be used directly (without a language model) as follows:
```python
import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
test_dataset = load_dataset("common_voice", "sv-SE", split="test[:2%]").
processor = Wav2Vec2Processor.from_pretrained("KBLab/wav2vec2-large-voxrex-swedish")
model = Wav2Vec2ForCTC.from_pretrained("KBLab/wav2vec2-large-voxrex-swedish")
resampler = torchaudio.transforms.Resample(48_000, 16_000)
# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    batch["speech"] = resampler(speech_array).squeeze().numpy()
    return batch
test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)
with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
predicted_ids = torch.argmax(logits, dim=-1)
print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["sentence"][:2])
```

## Citation

https://arxiv.org/abs/2205.03026

```
@misc{malmsten2022hearing,
      title={Hearing voices at the National Library -- a speech corpus and acoustic model for the Swedish language}, 
      author={Martin Malmsten and Chris Haffenden and Love Börjeson},
      year={2022},
      eprint={2205.03026},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
```