metadata

language: sv
datasets:
  - common_voice
  - NST Swedish ASR Database
  - P4
metrics:
  - wer
tags:
  - audio
  - automatic-speech-recognition
  - speech
license: cc0-1.0
model-index:
  - name: Wav2vec 2.0 large VoxRex Swedish 4-gram

KBLab's wav2vec 2.0 large VoxRex Swedish (C) with 4-gram model

Training of the acoustic model is the work of KBLab. See VoxRex-C for more details. This repo extends the acoustic model with a social media 4-gram language model for boosted performance.

Model description

VoxRex-C is extended with a 4-gram language model estimated from a subset extracted from The Swedish Culturomics Gigaword Corpus from Språkbanken. The subset contains 40M words from the social media genre between 2010 and 2015.

How to use

Audio should be downsampled to 16kHz.

import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
test_dataset = load_dataset("common_voice", "sv-SE", split="test[:2%]").
processor = Wav2Vec2Processor.from_pretrained("KBLab/wav2vec2-large-voxrex-swedish")
model = Wav2Vec2ForCTC.from_pretrained("KBLab/wav2vec2-large-voxrex-swedish")
resampler = torchaudio.transforms.Resample(48_000, 16_000)
# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    batch["speech"] = resampler(speech_array).squeeze().numpy()
    return batch
test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)
with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
predicted_ids = torch.argmax(logits, dim=-1)
print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["sentence"][:2])

Training procedure

Text data for the n-gram model is pre-processed by removing characters not part of the wav2vec 2.0 vocabulary and uppercasing all characters. After pre-processing and storing each text sample on a new line in a text file, a KenLM model is estimated. See this tutorial for more details.

Evaluation results

The model was evaluated on the full Common Voice test set version 6.1. VoxRex-C achieved a WER of 9.03% without the language model and 6.47% with the language model.