File size: 3,083 Bytes
c1c1361
4b3c635
 
c1c1361
4b3c635
 
c1c1361
 
4b3c635
c1c1361
4b3c635
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c1c1361
 
 
 
17d98f9
c1c1361
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
578dc67
c1c1361
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
---
language:
- da
datasets:
- common-voice-9
- nst
tags:
- speech-to-text
- hf-asr-leaderboard
license: apache-2.0
model-index:
- name: xls-r-300m-nst-cv9-da
  results:
  - task:
      name: Automatic Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: Common Voice 9.0 (Danish)
      type: mozilla-foundation/common_voice_9_0
      config: default
      split: test
      args:
        language: da
    metrics:
    - name: Test WER
      type: wer
      value: 10.8
  - task:
      name: Automatic Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: Alvenir ASR da eval
      type: Alvenir/alvenir_asr_da_eval
      config: default
      split: test
      args:
        language: da
    metrics:
    - name: Test WER
      type: wer
      value: 8.2
---

# xls-r-300m-danish-nst-cv9

This is a version of [facebook/wav2vec2-xls-r-300m](https://huggingface.co/facebook/wav2vec2-xls-r-300m) finetuned for Danish ASR on the training set of the public NST dataset and the Danish part of Common Voice 9. The model is trained on 16kHz, so ensure that you use the same sample rate. 

The model was trained using fairseq with [this config](https://github.com/centre-for-humanities-computing/Gjallarhorn/blob/main/fairseq_configs/finetuning/xlrs_finetune.yaml) for 120.000 steps. 


## Usage
```Python
import torch
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

# load model and tokenizer
processor = Wav2Vec2Processor.from_pretrained(
    "chcaa/xls-r-300m-nst-cv9-da")
model = Wav2Vec2ForCTC.from_pretrained(
    "chcaa/xls-r-300m-nst-cv9-da")

# load dataset and read soundfiles
ds = load_dataset("Alvenir/alvenir_asr_da_eval", split="test")

# tokenize
input_values = processor(
    ds[0]["audio"]["array"], return_tensors="pt", padding="longest"
).input_values  # Batch size 1

# retrieve logits
logits = model(input_values).logits

# take argmax and decode
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)
print(transcription)
```

## Performance
The table below shows the WER rate of four different Danish ASR models on three publicly available datasets (lower is better).

|Model                                  | [Alvenir](https://huggingface.co/datasets/Alvenir/alvenir_asr_da_eval)|   [NST](https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-19/)|     [CV9.0](https://huggingface.co/datasets/mozilla-foundation/common_voice_9_0)|
|:--------------------------------------|------:|-----:|-----:|
|[Alvenir/wav2vec2-base-da-ft-nst](https://huggingface.co/Alvenir/wav2vec2-base-da-ft-nst)        |  0.202| 0.099| 0.238|
|[chcaa/alvenir-wav2vec2-base-da-nst-cv9](https://huggingface.co/chcaa/alvenir-wav2vec2-base-da-nst-cv9) |  0.233| 0.126| 0.256|
|chcaa/xls-r-300m-nst-cv9-da   |  0.105| 0.060| 0.119|
|[chcaa/xls-r-300m-danish-nst-cv9](https://huggingface.co/chcaa/xls-r-300m-danish-nst-cv9)        |  0.082| 0.051| 0.108|

The model was finetuned in collaboration with [Alvenir](https://alvenir.ai).