--- language: - fr license: apache-2.0 tags: - whisper-event - generated_from_trainer datasets: - mozilla-foundation/common_voice_11_0 metrics: - wer - cer base_model: openai/whisper-large-v2 model-index: - name: Whisper Large French results: - task: type: automatic-speech-recognition name: Automatic Speech Recognition dataset: name: mozilla-foundation/common_voice_11_0 fr type: mozilla-foundation/common_voice_11_0 config: fr split: test args: fr metrics: - type: wer value: 9.086701085988961 name: WER - type: cer value: 3.327312134958326 name: CER - task: type: automatic-speech-recognition name: Automatic Speech Recognition dataset: name: google/fleurs fr_fr type: google/fleurs config: fr_fr split: test args: fr_fr metrics: - type: wer value: 8.6863088842391 name: WER - type: cer value: 5.089870653452041 name: CER --- # Whisper Large French This model is a fine-tuned version of [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2) on French using the train split of [Common Voice 11](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0). ## Usage ```python from transformers import pipeline transcriber = pipeline( "automatic-speech-recognition", model="jonatasgrosman/whisper-large-fr-cv11" ) transcriber.model.config.forced_decoder_ids = ( transcriber.tokenizer.get_decoder_prompt_ids( language="fr", task="transcribe" ) ) transcription = transcriber("path/to/my_audio.wav") ``` ## Evaluation I've performed the evaluation of the model using the test split of two datasets, the [Common Voice 11](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0) (same dataset used for the fine-tuning) and the [Fleurs](https://huggingface.co/datasets/google/fleurs) (dataset not seen during the fine-tuning). As Whisper can transcribe casing and punctuation, I've performed the model evaluation in 2 different scenarios, one using the raw text and the other using the normalized text (lowercase + removal of punctuations). Additionally, for the Fleurs dataset, I've evaluated the model in a scenario where there are no transcriptions of numerical values since the way these values are described in this dataset is different from how they are described in the dataset used in fine-tuning (Common Voice), so it is expected that this difference in the way of describing numerical values will affect the performance of the model for this type of transcription in Fleurs. ### Common Voice 11 | | CER | WER | | --- | --- | --- | | [jonatasgrosman/whisper-large-fr-cv11](https://huggingface.co/jonatasgrosman/whisper-large-fr-cv11) | 4.31 | 13.66 | | [jonatasgrosman/whisper-large-fr-cv11](https://huggingface.co/jonatasgrosman/whisper-large-fr-cv11) + text normalization | 3.33 | 9.09 | | [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2) | 7.17 | 18.99 | | [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2) + text normalization | 5.74 | 12.82 | ### Fleurs | | CER | WER | | --- | --- | --- | | [jonatasgrosman/whisper-large-fr-cv11](https://huggingface.co/jonatasgrosman/whisper-large-fr-cv11) | 4.96 | 14.24 | | [jonatasgrosman/whisper-large-fr-cv11](https://huggingface.co/jonatasgrosman/whisper-large-fr-cv11) + text normalization | 5.09 | 8.69 | | [jonatasgrosman/whisper-large-fr-cv11](https://huggingface.co/jonatasgrosman/whisper-large-fr-cv11) + keep only non-numeric samples | 3.14 | 12.10 | | [jonatasgrosman/whisper-large-fr-cv11](https://huggingface.co/jonatasgrosman/whisper-large-fr-cv11) + text normalization + keep only non-numeric samples | 3.60 | 6.94 | | [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2) | 3.55 | 12.81 | | [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2) + text normalization | 3.76 | 7.59 | | [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2) + keep only non-numeric samples | 3.12 | 11.24 | | [openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2) + text normalization + keep only non-numeric samples | 3.65 | 6.99 |