HLasse's picture
Update README.md
4b3c635
metadata
language:
  - da
datasets:
  - common-voice-9
  - nst
tags:
  - speech-to-text
  - hf-asr-leaderboard
license: apache-2.0
model-index:
  - name: xls-r-300m-nst-cv9-da
    results:
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: Common Voice 9.0 (Danish)
          type: mozilla-foundation/common_voice_9_0
          config: default
          split: test
          args:
            language: da
        metrics:
          - name: Test WER
            type: wer
            value: 10.8
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: Alvenir ASR da eval
          type: Alvenir/alvenir_asr_da_eval
          config: default
          split: test
          args:
            language: da
        metrics:
          - name: Test WER
            type: wer
            value: 8.2

xls-r-300m-danish-nst-cv9

This is a version of facebook/wav2vec2-xls-r-300m finetuned for Danish ASR on the training set of the public NST dataset and the Danish part of Common Voice 9. The model is trained on 16kHz, so ensure that you use the same sample rate.

The model was trained using fairseq with this config for 120.000 steps.

Usage

import torch
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

# load model and tokenizer
processor = Wav2Vec2Processor.from_pretrained(
    "chcaa/xls-r-300m-nst-cv9-da")
model = Wav2Vec2ForCTC.from_pretrained(
    "chcaa/xls-r-300m-nst-cv9-da")

# load dataset and read soundfiles
ds = load_dataset("Alvenir/alvenir_asr_da_eval", split="test")

# tokenize
input_values = processor(
    ds[0]["audio"]["array"], return_tensors="pt", padding="longest"
).input_values  # Batch size 1

# retrieve logits
logits = model(input_values).logits

# take argmax and decode
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)
print(transcription)

Performance

The table below shows the WER rate of four different Danish ASR models on three publicly available datasets (lower is better).

Model Alvenir NST CV9.0
Alvenir/wav2vec2-base-da-ft-nst 0.202 0.099 0.238
chcaa/alvenir-wav2vec2-base-da-nst-cv9 0.233 0.126 0.256
chcaa/xls-r-300m-nst-cv9-da 0.105 0.060 0.119
chcaa/xls-r-300m-danish-nst-cv9 0.082 0.051 0.108

The model was finetuned in collaboration with Alvenir.