Edit model card

Fine-tuned wav2vec2-FR-7K-large model for ASR in French

Model architecture Model size Language

This model is a fine-tuned version of LeBenchmark/wav2vec2-FR-7K-large, trained on a composite dataset comprising of over 2200 hours of French speech audio, using the train and validation splits of Common Voice 11.0, Multilingual LibriSpeech, Voxpopuli, Multilingual TEDx, MediaSpeech, and African Accented French. When using the model make sure that your speech input is also sampled at 16Khz.

Usage

  1. To use on a local audio file with the language model
import torch
import torchaudio

from transformers import AutoModelForCTC, Wav2Vec2ProcessorWithLM

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

model = AutoModelForCTC.from_pretrained("bhuang/asr-wav2vec2-french").to(device)
processor_with_lm = Wav2Vec2ProcessorWithLM.from_pretrained("bhuang/asr-wav2vec2-french")
model_sample_rate = processor_with_lm.feature_extractor.sampling_rate

wav_path = "example.wav"  # path to your audio file
waveform, sample_rate = torchaudio.load(wav_path)
waveform = waveform.squeeze(axis=0)  # mono

# resample
if sample_rate != model_sample_rate:
    resampler = torchaudio.transforms.Resample(sample_rate, model_sample_rate)
    waveform = resampler(waveform)

# normalize
input_dict = processor_with_lm(waveform, sampling_rate=model_sample_rate, return_tensors="pt")

with torch.inference_mode():
    logits = model(input_dict.input_values.to(device)).logits

predicted_sentence = processor_with_lm.batch_decode(logits.cpu().numpy()).text[0]
  1. To use on a local audio file without the language model
import torch
import torchaudio

from transformers import AutoModelForCTC, Wav2Vec2Processor

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

model = AutoModelForCTC.from_pretrained("bhuang/asr-wav2vec2-french").to(device)
processor = Wav2Vec2Processor.from_pretrained("bhuang/asr-wav2vec2-french")
model_sample_rate = processor.feature_extractor.sampling_rate

wav_path = "example.wav"  # path to your audio file
waveform, sample_rate = torchaudio.load(wav_path)
waveform = waveform.squeeze(axis=0)  # mono

# resample
if sample_rate != model_sample_rate:
    resampler = torchaudio.transforms.Resample(sample_rate, model_sample_rate)
    waveform = resampler(waveform)

# normalize
input_dict = processor(waveform, sampling_rate=model_sample_rate, return_tensors="pt")

with torch.inference_mode():
    logits = model(input_dict.input_values.to(device)).logits

# decode
predicted_ids = torch.argmax(logits, dim=-1)
predicted_sentence = processor.batch_decode(predicted_ids)[0]

Evaluation

  1. To evaluate on mozilla-foundation/common_voice_11_0
python eval.py \
  --model_id "bhuang/asr-wav2vec2-french" \
  --dataset "mozilla-foundation/common_voice_11_0" \
  --config "fr" \
  --split "test" \
  --log_outputs \
  --outdir "outputs/results_mozilla-foundatio_common_voice_11_0_with_lm"
  1. To evaluate on speech-recognition-community-v2/dev_data
python eval.py \
  --model_id "bhuang/asr-wav2vec2-french" \
  --dataset "speech-recognition-community-v2/dev_data" \
  --config "fr" \
  --split "validation" \
  --chunk_length_s 30.0 \
  --stride_length_s 5.0 \
  --log_outputs \
  --outdir "outputs/results_speech-recognition-community-v2_dev_data_with_lm"
Downloads last month
444
Safetensors
Model size
315M params
Tensor type
F32
Β·

Datasets used to train bofenghuang/asr-wav2vec2-ctc-french

Spaces using bofenghuang/asr-wav2vec2-ctc-french 2

Collection including bofenghuang/asr-wav2vec2-ctc-french

Evaluation results