--- language: - fr license: apache-2.0 tags: - automatic-speech-recognition - mozilla-foundation/common_voice_9_0 - generated_from_trainer - hf-asr-leaderboard - robust-speech-event datasets: - common_voice - mozilla-foundation/common_voice_9_0 model-index: - name: Fine-tuned Wav2Vec2 XLS-R 1B model for ASR in French results: - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: Common Voice 9 type: mozilla-foundation/common_voice_9_0 args: fr metrics: - name: Test WER type: wer value: 12.72 - name: Test CER type: cer value: 3.78 - name: Test WER (+LM) type: wer value: 10.60 - name: Test CER (+LM) type: cer value: 3.41 - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: Robust Speech Event - Dev Data type: speech-recognition-community-v2/dev_data args: fr metrics: - name: Test WER type: wer value: 24.28 - name: Test CER type: cer value: 11.46 - name: Test WER (+LM) type: wer value: 20.85 - name: Test CER (+LM) type: cer value: 11.09 --- # Fine-tuned Wav2Vec2 XLS-R 1B model for ASR in French This model is a fine-tuned version of [facebook/wav2vec2-xls-r-1b](https://huggingface.co/facebook/wav2vec2-xls-r-1b) on the MOZILLA-FOUNDATION/COMMON_VOICE_9_0 - FR dataset. ## Usage 1. To use on a local audio file without the language model ```python import torch import torchaudio from transformers import AutoModelForCTC, Wav2Vec2Processor processor = Wav2Vec2Processor.from_pretrained("bhuang/wav2vec2-xls-r-1b-cv9-fr") model = AutoModelForCTC.from_pretrained("bhuang/wav2vec2-xls-r-1b-cv9-fr").cuda() # path to your audio file wav_path = "/projects/bhuang/corpus/speech/multilingual-tedx/fr-fr/flac/09UU0I9gLNc_0.flac" waveform, sample_rate = torchaudio.load(wav_path) waveform = waveform.squeeze(axis=0) # mono # resample if sample_rate != 16_000: resampler = torchaudio.transforms.Resample(sample_rate, 16_000) waveform = resampler(waveform) # normalize input_dict = processor(waveform, sampling_rate=16_000, return_tensors="pt") with torch.inference_mode(): logits = model(input_dict.input_values.to("cuda")).logits # decode predicted_ids = torch.argmax(logits, dim=-1) predicted_sentence = processor.batch_decode(predicted_ids)[0] ``` 2. To use on a local audio file with the language model ```python import torch import torchaudio from transformers import AutoModelForCTC, Wav2Vec2ProcessorWithLM processor_with_lm = Wav2Vec2ProcessorWithLM.from_pretrained("bhuang/wav2vec2-xls-r-1b-cv9-fr") model = AutoModelForCTC.from_pretrained("bhuang/wav2vec2-xls-r-1b-cv9-fr").cuda() model_sampling_rate = processor_with_lm.feature_extractor.sampling_rate # path to your audio file wav_path = "/projects/bhuang/corpus/speech/multilingual-tedx/fr-fr/flac/09UU0I9gLNc_0.flac" waveform, sample_rate = torchaudio.load(wav_path) waveform = waveform.squeeze(axis=0) # mono # resample if sample_rate != 16_000: resampler = torchaudio.transforms.Resample(sample_rate, 16_000) waveform = resampler(waveform) # normalize input_dict = processor_with_lm(waveform, sampling_rate=16_000, return_tensors="pt") with torch.inference_mode(): logits = model(input_dict.input_values.to("cuda")).logits predicted_sentence = processor_with_lm.batch_decode(logits.cpu().numpy()).text[0] ``` ## Evaluation 1. To evaluate on `mozilla-foundation/common_voice_9_0` ```bash python eval.py \ --model_id "bhuang/wav2vec2-xls-r-1b-cv9-fr" \ --dataset "mozilla-foundation/common_voice_9_0" \ --config "fr" \ --split "test" \ --log_outputs \ --outdir "outputs/results_mozilla-foundatio_common_voice_9_0_with_lm" ``` 2. To evaluate on `speech-recognition-community-v2/dev_data` ```bash python eval.py \ --model_id "bhuang/wav2vec2-xls-r-1b-cv9-fr" \ --dataset "speech-recognition-community-v2/dev_data" \ --config "fr" \ --split "validation" \ --chunk_length_s 5.0 \ --stride_length_s 1.0 \ --log_outputs \ --outdir "outputs/results_speech-recognition-community-v2_dev_data_with_lm" ```