--- license: apache-2.0 language: fr library_name: transformers thumbnail: null tags: - automatic-speech-recognition - hf-asr-leaderboard - robust-speech-event - CTC - Wav2vec2 datasets: - common_voice - mozilla-foundation/common_voice_11_0 - facebook/multilingual_librispeech - facebook/voxpopuli - gigant/african_accented_french metrics: - wer model-index: - name: Fine-tuned wav2vec2-FR-7K-large model for ASR in French results: - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: Common Voice 11.0 type: mozilla-foundation/common_voice_11_0 args: fr metrics: - name: Test WER type: wer value: 11.44 - name: Test WER (+LM) type: wer value: 9.66 - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: Multilingual LibriSpeech (MLS) type: facebook/multilingual_librispeech args: french metrics: - name: Test WER type: wer value: 5.93 - name: Test WER (+LM) type: wer value: 5.13 - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: VoxPopuli type: facebook/voxpopuli args: fr metrics: - name: Test WER type: wer value: 9.33 - name: Test WER (+LM) type: wer value: 8.51 - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: African Accented French type: gigant/african_accented_french args: fr metrics: - name: Test WER type: wer value: 16.22 - name: Test WER (+LM) type: wer value: 15.39 - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: Robust Speech Event - Dev Data type: speech-recognition-community-v2/dev_data args: fr metrics: - name: Test WER type: wer value: 16.56 - name: Test WER (+LM) type: wer value: 12.96 - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: Fleurs type: google/fleurs args: fr_fr metrics: - name: Test WER type: wer value: 10.10 - name: Test WER (+LM) type: wer value: 8.84 --- # Fine-tuned wav2vec2-FR-7K-large model for ASR in French ![Model architecture](https://img.shields.io/badge/Model_Architecture-Wav2Vec2--CTC-lightgrey) ![Model size](https://img.shields.io/badge/Params-315M-lightgrey) ![Language](https://img.shields.io/badge/Language-French-lightgrey) This model is a fine-tuned version of [LeBenchmark/wav2vec2-FR-7K-large](https://huggingface.co/LeBenchmark/wav2vec2-FR-7K-large), trained on a composite dataset comprising of over 2200 hours of French speech audio, using the train and validation splits of [Common Voice 11.0](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0), [Multilingual LibriSpeech](https://huggingface.co/datasets/facebook/multilingual_librispeech), [Voxpopuli](https://github.com/facebookresearch/voxpopuli), [Multilingual TEDx](http://www.openslr.org/100), [MediaSpeech](https://www.openslr.org/108), and [African Accented French](https://huggingface.co/datasets/gigant/african_accented_french). When using the model make sure that your speech input is also sampled at 16Khz. ## Usage 1. To use on a local audio file with the language model ```python import torch import torchaudio from transformers import AutoModelForCTC, Wav2Vec2ProcessorWithLM device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") model = AutoModelForCTC.from_pretrained("bhuang/asr-wav2vec2-french").to(device) processor_with_lm = Wav2Vec2ProcessorWithLM.from_pretrained("bhuang/asr-wav2vec2-french") model_sample_rate = processor_with_lm.feature_extractor.sampling_rate wav_path = "example.wav" # path to your audio file waveform, sample_rate = torchaudio.load(wav_path) waveform = waveform.squeeze(axis=0) # mono # resample if sample_rate != model_sample_rate: resampler = torchaudio.transforms.Resample(sample_rate, model_sample_rate) waveform = resampler(waveform) # normalize input_dict = processor_with_lm(waveform, sampling_rate=model_sample_rate, return_tensors="pt") with torch.inference_mode(): logits = model(input_dict.input_values.to(device)).logits predicted_sentence = processor_with_lm.batch_decode(logits.cpu().numpy()).text[0] ``` 2. To use on a local audio file without the language model ```python import torch import torchaudio from transformers import AutoModelForCTC, Wav2Vec2Processor device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") model = AutoModelForCTC.from_pretrained("bhuang/asr-wav2vec2-french").to(device) processor = Wav2Vec2Processor.from_pretrained("bhuang/asr-wav2vec2-french") model_sample_rate = processor.feature_extractor.sampling_rate wav_path = "example.wav" # path to your audio file waveform, sample_rate = torchaudio.load(wav_path) waveform = waveform.squeeze(axis=0) # mono # resample if sample_rate != model_sample_rate: resampler = torchaudio.transforms.Resample(sample_rate, model_sample_rate) waveform = resampler(waveform) # normalize input_dict = processor(waveform, sampling_rate=model_sample_rate, return_tensors="pt") with torch.inference_mode(): logits = model(input_dict.input_values.to(device)).logits # decode predicted_ids = torch.argmax(logits, dim=-1) predicted_sentence = processor.batch_decode(predicted_ids)[0] ``` ## Evaluation 1. To evaluate on `mozilla-foundation/common_voice_11_0` ```bash python eval.py \ --model_id "bhuang/asr-wav2vec2-french" \ --dataset "mozilla-foundation/common_voice_11_0" \ --config "fr" \ --split "test" \ --log_outputs \ --outdir "outputs/results_mozilla-foundatio_common_voice_11_0_with_lm" ``` 2. To evaluate on `speech-recognition-community-v2/dev_data` ```bash python eval.py \ --model_id "bhuang/asr-wav2vec2-french" \ --dataset "speech-recognition-community-v2/dev_data" \ --config "fr" \ --split "validation" \ --chunk_length_s 30.0 \ --stride_length_s 5.0 \ --log_outputs \ --outdir "outputs/results_speech-recognition-community-v2_dev_data_with_lm" ```