metadata

language:
  - fr
license: apache-2.0
tags:
  - whisper-event
  - generated_from_trainer
datasets:
  - mozilla-foundation/common_voice_11_0
metrics:
  - wer
  - cer
base_model: openai/whisper-large-v2
model-index:
  - name: Whisper Large French
    results:
      - task:
          type: automatic-speech-recognition
          name: Automatic Speech Recognition
        dataset:
          name: mozilla-foundation/common_voice_11_0 fr
          type: mozilla-foundation/common_voice_11_0
          config: fr
          split: test
          args: fr
        metrics:
          - type: wer
            value: 9.086701085988961
            name: WER
          - type: cer
            value: 3.327312134958326
            name: CER
      - task:
          type: automatic-speech-recognition
          name: Automatic Speech Recognition
        dataset:
          name: google/fleurs fr_fr
          type: google/fleurs
          config: fr_fr
          split: test
          args: fr_fr
        metrics:
          - type: wer
            value: 8.6863088842391
            name: WER
          - type: cer
            value: 5.089870653452041
            name: CER

Whisper Large French

This model is a fine-tuned version of openai/whisper-large-v2 on French using the train split of Common Voice 11.

Usage


from transformers import pipeline

transcriber = pipeline(
  "automatic-speech-recognition", 
  model="jonatasgrosman/whisper-large-fr-cv11"
)

transcriber.model.config.forced_decoder_ids = (
  transcriber.tokenizer.get_decoder_prompt_ids(
    language="fr", 
    task="transcribe"
  )
)

transcription = transcriber("path/to/my_audio.wav")

Evaluation

I've performed the evaluation of the model using the test split of two datasets, the Common Voice 11 (same dataset used for the fine-tuning) and the Fleurs (dataset not seen during the fine-tuning). As Whisper can transcribe casing and punctuation, I've performed the model evaluation in 2 different scenarios, one using the raw text and the other using the normalized text (lowercase + removal of punctuations). Additionally, for the Fleurs dataset, I've evaluated the model in a scenario where there are no transcriptions of numerical values since the way these values are described in this dataset is different from how they are described in the dataset used in fine-tuning (Common Voice), so it is expected that this difference in the way of describing numerical values will affect the performance of the model for this type of transcription in Fleurs.

Common Voice 11

	CER	WER
jonatasgrosman/whisper-large-fr-cv11	4.31	13.66
jonatasgrosman/whisper-large-fr-cv11 + text normalization	3.33	9.09
openai/whisper-large-v2	7.17	18.99
openai/whisper-large-v2 + text normalization	5.74	12.82

Fleurs

	CER	WER
jonatasgrosman/whisper-large-fr-cv11	4.96	14.24
jonatasgrosman/whisper-large-fr-cv11 + text normalization	5.09	8.69
jonatasgrosman/whisper-large-fr-cv11 + keep only non-numeric samples	3.14	12.10
jonatasgrosman/whisper-large-fr-cv11 + text normalization + keep only non-numeric samples	3.60	6.94
openai/whisper-large-v2	3.55	12.81
openai/whisper-large-v2 + text normalization	3.76	7.59
openai/whisper-large-v2 + keep only non-numeric samples	3.12	11.24
openai/whisper-large-v2 + text normalization + keep only non-numeric samples	3.65	6.99