librarian-bot's picture
Librarian Bot: Add base_model information to model
747096c
|
raw
history blame
4.58 kB
metadata
language:
  - pt
license: apache-2.0
tags:
  - automatic-speech-recognition
  - hf-asr-leaderboard
  - mozilla-foundation/common_voice_8_0
  - pt
  - robust-speech-event
datasets:
  - mozilla-foundation/common_voice_8_0
base_model: facebook/wav2vec2-xls-r-1b
model-index:
  - name: XLS-R Wav2Vec2 Portuguese by Jonatas Grosman
    results:
      - task:
          type: automatic-speech-recognition
          name: Automatic Speech Recognition
        dataset:
          name: Common Voice 8
          type: mozilla-foundation/common_voice_8_0
          args: pt
        metrics:
          - type: wer
            value: 8.7
            name: Test WER
          - type: cer
            value: 2.55
            name: Test CER
          - type: wer
            value: 6.04
            name: Test WER (+LM)
          - type: cer
            value: 1.98
            name: Test CER (+LM)
      - task:
          type: automatic-speech-recognition
          name: Automatic Speech Recognition
        dataset:
          name: Robust Speech Event - Dev Data
          type: speech-recognition-community-v2/dev_data
          args: pt
        metrics:
          - type: wer
            value: 24.23
            name: Dev WER
          - type: cer
            value: 11.3
            name: Dev CER
          - type: wer
            value: 19.41
            name: Dev WER (+LM)
          - type: cer
            value: 10.19
            name: Dev CER (+LM)
      - task:
          type: automatic-speech-recognition
          name: Automatic Speech Recognition
        dataset:
          name: Robust Speech Event - Test Data
          type: speech-recognition-community-v2/eval_data
          args: pt
        metrics:
          - type: wer
            value: 18.8
            name: Test WER

Fine-tuned XLS-R 1B model for speech recognition in Portuguese

Fine-tuned facebook/wav2vec2-xls-r-1b on Portuguese using the train and validation splits of Common Voice 8.0, CORAA, Multilingual TEDx, and Multilingual LibriSpeech. When using this model, make sure that your speech input is sampled at 16kHz.

This model has been fine-tuned by the HuggingSound tool, and thanks to the GPU credits generously given by the OVHcloud :)

Usage

Using the HuggingSound library:

from huggingsound import SpeechRecognitionModel

model = SpeechRecognitionModel("jonatasgrosman/wav2vec2-xls-r-1b-portuguese")
audio_paths = ["/path/to/file.mp3", "/path/to/another_file.wav"]

transcriptions = model.transcribe(audio_paths)

Writing your own inference script:

import torch
import librosa
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

LANG_ID = "pt"
MODEL_ID = "jonatasgrosman/wav2vec2-xls-r-1b-portuguese"
SAMPLES = 10

test_dataset = load_dataset("common_voice", LANG_ID, split=f"test[:{SAMPLES}]")

processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
    batch["speech"] = speech_array
    batch["sentence"] = batch["sentence"].upper()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)
predicted_sentences = processor.batch_decode(predicted_ids)

Evaluation Commands

  1. To evaluate on mozilla-foundation/common_voice_8_0 with split test
python eval.py --model_id jonatasgrosman/wav2vec2-xls-r-1b-portuguese --dataset mozilla-foundation/common_voice_8_0 --config pt --split test
  1. To evaluate on speech-recognition-community-v2/dev_data
python eval.py --model_id jonatasgrosman/wav2vec2-xls-r-1b-portuguese --dataset speech-recognition-community-v2/dev_data --config pt --split validation --chunk_length_s 5.0 --stride_length_s 1.0

Citation

If you want to cite this model you can use this:

@misc{grosman2021xlsr-1b-portuguese,
  title={Fine-tuned {XLS-R} 1{B} model for speech recognition in {P}ortuguese},
  author={Grosman, Jonatas},
  howpublished={\url{https://huggingface.co/jonatasgrosman/wav2vec2-xls-r-1b-portuguese}},
  year={2022}
}