Edit model card

Fine-tuned XLS-R 1B model for speech recognition in Polish

Fine-tuned facebook/wav2vec2-xls-r-1b on Polish using the train and validation splits of Common Voice 8.0, Multilingual LibriSpeech, and Voxpopuli. When using this model, make sure that your speech input is sampled at 16kHz.

This model has been fine-tuned by the HuggingSound tool, and thanks to the GPU credits generously given by the OVHcloud :)


Using the HuggingSound library:

from huggingsound import SpeechRecognitionModel

model = SpeechRecognitionModel("jonatasgrosman/wav2vec2-xls-r-1b-polish")
audio_paths = ["/path/to/file.mp3", "/path/to/another_file.wav"]

transcriptions = model.transcribe(audio_paths)

Writing your own inference script:

import torch
import librosa
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

LANG_ID = "pl"
MODEL_ID = "jonatasgrosman/wav2vec2-xls-r-1b-polish"

test_dataset = load_dataset("common_voice", LANG_ID, split=f"test[:{SAMPLES}]")

processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
    batch["speech"] = speech_array
    batch["sentence"] = batch["sentence"].upper()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)
predicted_sentences = processor.batch_decode(predicted_ids)

Evaluation Commands

  1. To evaluate on mozilla-foundation/common_voice_8_0 with split test
python eval.py --model_id jonatasgrosman/wav2vec2-xls-r-1b-polish --dataset mozilla-foundation/common_voice_8_0 --config pl --split test
  1. To evaluate on speech-recognition-community-v2/dev_data
python eval.py --model_id jonatasgrosman/wav2vec2-xls-r-1b-polish --dataset speech-recognition-community-v2/dev_data --config pl --split validation --chunk_length_s 5.0 --stride_length_s 1.0


If you want to cite this model you can use this:

  title={Fine-tuned {XLS-R} 1{B} model for speech recognition in {P}olish},
  author={Grosman, Jonatas},
Downloads last month

Dataset used to train jonatasgrosman/wav2vec2-xls-r-1b-polish

Spaces using jonatasgrosman/wav2vec2-xls-r-1b-polish 5

Evaluation results