Edit model card

Fine-tuned XLS-R 1B model for speech recognition in Portuguese

Fine-tuned facebook/wav2vec2-xls-r-1b on Portuguese using the train and validation splits of Common Voice 8.0, CORAA, Multilingual TEDx, and Multilingual LibriSpeech. When using this model, make sure that your speech input is sampled at 16kHz.

This model has been fine-tuned by the HuggingSound tool, and thanks to the GPU credits generously given by the OVHcloud :)


Using the HuggingSound library:

from huggingsound import SpeechRecognitionModel

model = SpeechRecognitionModel("jonatasgrosman/wav2vec2-xls-r-1b-portuguese")
audio_paths = ["/path/to/file.mp3", "/path/to/another_file.wav"]

transcriptions = model.transcribe(audio_paths)

Writing your own inference script:

import torch
import librosa
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

LANG_ID = "pt"
MODEL_ID = "jonatasgrosman/wav2vec2-xls-r-1b-portuguese"

test_dataset = load_dataset("common_voice", LANG_ID, split=f"test[:{SAMPLES}]")

processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
    batch["speech"] = speech_array
    batch["sentence"] = batch["sentence"].upper()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)
predicted_sentences = processor.batch_decode(predicted_ids)

Evaluation Commands

  1. To evaluate on mozilla-foundation/common_voice_8_0 with split test
python eval.py --model_id jonatasgrosman/wav2vec2-xls-r-1b-portuguese --dataset mozilla-foundation/common_voice_8_0 --config pt --split test
  1. To evaluate on speech-recognition-community-v2/dev_data
python eval.py --model_id jonatasgrosman/wav2vec2-xls-r-1b-portuguese --dataset speech-recognition-community-v2/dev_data --config pt --split validation --chunk_length_s 5.0 --stride_length_s 1.0


If you want to cite this model you can use this:

  title={Fine-tuned {XLS-R} 1{B} model for speech recognition in {P}ortuguese},
  author={Grosman, Jonatas},
Downloads last month
Hosted inference API
or or
This model can be loaded on the Inference API on-demand.

Dataset used to train jonatasgrosman/wav2vec2-xls-r-1b-portuguese

Spaces using jonatasgrosman/wav2vec2-xls-r-1b-portuguese

Evaluation results