metadata

language: fr
license: apache-2.0
library_name: transformers
tags:
  - automatic-speech-recognition
  - hf-asr-leaderboard
  - robust-speech-event
  - CTC
  - Wav2vec2
datasets:
  - common_voice
  - mozilla-foundation/common_voice_11_0
  - facebook/multilingual_librispeech
  - facebook/voxpopuli
  - gigant/african_accented_french
metrics:
  - wer
base_model: LeBenchmark/wav2vec2-FR-7K-large
model-index:
  - name: Fine-tuned wav2vec2-FR-7K-large model for ASR in French
    results:
      - task:
          type: automatic-speech-recognition
          name: Automatic Speech Recognition
        dataset:
          name: Common Voice 11.0
          type: mozilla-foundation/common_voice_11_0
          args: fr
        metrics:
          - type: wer
            value: 11.44
            name: Test WER
          - type: wer
            value: 9.66
            name: Test WER (+LM)
      - task:
          type: automatic-speech-recognition
          name: Automatic Speech Recognition
        dataset:
          name: Multilingual LibriSpeech (MLS)
          type: facebook/multilingual_librispeech
          args: french
        metrics:
          - type: wer
            value: 5.93
            name: Test WER
          - type: wer
            value: 5.13
            name: Test WER (+LM)
      - task:
          type: automatic-speech-recognition
          name: Automatic Speech Recognition
        dataset:
          name: VoxPopuli
          type: facebook/voxpopuli
          args: fr
        metrics:
          - type: wer
            value: 9.33
            name: Test WER
          - type: wer
            value: 8.51
            name: Test WER (+LM)
      - task:
          type: automatic-speech-recognition
          name: Automatic Speech Recognition
        dataset:
          name: African Accented French
          type: gigant/african_accented_french
          args: fr
        metrics:
          - type: wer
            value: 16.22
            name: Test WER
          - type: wer
            value: 15.39
            name: Test WER (+LM)
      - task:
          type: automatic-speech-recognition
          name: Automatic Speech Recognition
        dataset:
          name: Robust Speech Event - Dev Data
          type: speech-recognition-community-v2/dev_data
          args: fr
        metrics:
          - type: wer
            value: 16.56
            name: Test WER
          - type: wer
            value: 12.96
            name: Test WER (+LM)
      - task:
          type: automatic-speech-recognition
          name: Automatic Speech Recognition
        dataset:
          name: Fleurs
          type: google/fleurs
          args: fr_fr
        metrics:
          - type: wer
            value: 10.1
            name: Test WER
          - type: wer
            value: 8.84
            name: Test WER (+LM)

Fine-tuned wav2vec2-FR-7K-large model for ASR in French

This model is a fine-tuned version of LeBenchmark/wav2vec2-FR-7K-large, trained on a composite dataset comprising of over 2200 hours of French speech audio, using the train and validation splits of Common Voice 11.0, Multilingual LibriSpeech, Voxpopuli, Multilingual TEDx, MediaSpeech, and African Accented French. When using the model make sure that your speech input is also sampled at 16Khz.

Usage

To use on a local audio file with the language model

import torch
import torchaudio

from transformers import AutoModelForCTC, Wav2Vec2ProcessorWithLM

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

model = AutoModelForCTC.from_pretrained("bhuang/asr-wav2vec2-french").to(device)
processor_with_lm = Wav2Vec2ProcessorWithLM.from_pretrained("bhuang/asr-wav2vec2-french")
model_sample_rate = processor_with_lm.feature_extractor.sampling_rate

wav_path = "example.wav"  # path to your audio file
waveform, sample_rate = torchaudio.load(wav_path)
waveform = waveform.squeeze(axis=0)  # mono

# resample
if sample_rate != model_sample_rate:
    resampler = torchaudio.transforms.Resample(sample_rate, model_sample_rate)
    waveform = resampler(waveform)

# normalize
input_dict = processor_with_lm(waveform, sampling_rate=model_sample_rate, return_tensors="pt")

with torch.inference_mode():
    logits = model(input_dict.input_values.to(device)).logits

predicted_sentence = processor_with_lm.batch_decode(logits.cpu().numpy()).text[0]

To use on a local audio file without the language model

import torch
import torchaudio

from transformers import AutoModelForCTC, Wav2Vec2Processor

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

model = AutoModelForCTC.from_pretrained("bhuang/asr-wav2vec2-french").to(device)
processor = Wav2Vec2Processor.from_pretrained("bhuang/asr-wav2vec2-french")
model_sample_rate = processor.feature_extractor.sampling_rate

wav_path = "example.wav"  # path to your audio file
waveform, sample_rate = torchaudio.load(wav_path)
waveform = waveform.squeeze(axis=0)  # mono

# resample
if sample_rate != model_sample_rate:
    resampler = torchaudio.transforms.Resample(sample_rate, model_sample_rate)
    waveform = resampler(waveform)

# normalize
input_dict = processor(waveform, sampling_rate=model_sample_rate, return_tensors="pt")

with torch.inference_mode():
    logits = model(input_dict.input_values.to(device)).logits

# decode
predicted_ids = torch.argmax(logits, dim=-1)
predicted_sentence = processor.batch_decode(predicted_ids)[0]

Evaluation

To evaluate on mozilla-foundation/common_voice_11_0

python eval.py \
  --model_id "bhuang/asr-wav2vec2-french" \
  --dataset "mozilla-foundation/common_voice_11_0" \
  --config "fr" \
  --split "test" \
  --log_outputs \
  --outdir "outputs/results_mozilla-foundatio_common_voice_11_0_with_lm"

To evaluate on speech-recognition-community-v2/dev_data

python eval.py \
  --model_id "bhuang/asr-wav2vec2-french" \
  --dataset "speech-recognition-community-v2/dev_data" \
  --config "fr" \
  --split "validation" \
  --chunk_length_s 30.0 \
  --stride_length_s 5.0 \
  --log_outputs \
  --outdir "outputs/results_speech-recognition-community-v2_dev_data_with_lm"