Whisper-Large-V3-Illuin-French

This model is a finetuned variant of openai's whisper-large-v3 model. It has been finetuned on a dataset of more than 18 000 hours of french speech.

This model has been converted and tested into some other formats to allow use with the most popular inference frameworks:

transformers
openai-whisper
fasterwhisper
whisper.cpp

The models can be found in this collection

Training details

Dataset Composition:

The dataset is a compilation of various popular French ASR (Automatic Speech Recognition) datasets, including:

CommonVoice 13 French
LibriSpeech French
African accented French
TEDx French
VoxPopuli French
Fleurs French

The total dataset comprises a little over 2 500 hours of speech data from these sources. Additionally, it includes transcribed french speech scraped from the internet. In total, this dataset exceeds 18 000 hours of speech data, which makes it one of the largest french asr datasets assembled to date.

Dataset processings

We agressively filtered and cleaned the raw internet dataset through extensive heuristic filtering, as well as language verification and quality estimation models. Other data sources did not require as much preprocessing, but underwent Large Language Model verification and rephrasing for punctuations and minor correction fixes (Mixtral 8x7B). We further enhance our dataset for real-word conditions by stochastically subjecting audio to various compression codecs and simulating issues such as packet lossto replicate call-center environments. This extensive preprocessing pipeline enables us to obtain 18k hours of high quality labeled French audio we use to train our SOTA French ASR models.

Training

We trained on 2 epochs with an effective batch size of 256, a maximum learning rate of 1e-5 and a linear learning rate scheduler with 500 warmup steps. The full dataset being prohibitively large, we used mosaicml streaming dataset to enable streaming of the dataset samples and instant mid-epoch resumption.

Performance

The French ASR datasets lacked a publicly available dataset of real call-center conditions, akin to the Switchboard dataset in English. To address this gap, we filtered and cleaned the Accueil_UBS dataset sourced from Ortolang. This preparation enabled the evaluation of ASR models under conditions similar to those encountered in call-center environments.

Inference

We offer the model in various formats to ensure compatibility with the most widely used inference frameworks. It's important to note that the model hasn't undergone fine-tuning with timestamps, thus it cannot accurately predict timestamps on its own. However, leveraging cross-attention enables us to obtain more precise timestamps at a lower computational cost. In most frameworks, enabling this feature involves adding parameters such as without_timestamps=True and word_timestamps=True.

While it can still handle receiving previous text during inference, its performance under this condition hasn't been quantitatively evaluated. Additionally, it's been observed that enabling this option raises the risk of hallucination based on the base OpenAI model. Therefore, it's advised to disable this option to mitigate potential issues

Examples:

transformers:

from transformers import AutomaticSpeechRecognitionPipeline, WhisperForConditionalGeneration
from transformers import AutoModel, AutoTokenizer, AutoFeatureExtractor

model_path = "BrunoHays/whisper-large-v3-french-illuin"
model = WhisperForConditionalGeneration.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
processor = AutoFeatureExtractor.from_pretrained(model_path)
pipe = AutomaticSpeechRecognitionPipeline(model=model, feature_extractor=processor, tokenizer=tokenizer)
transcript = pipe("audio_samples/short_rd.wav", return_timestamps=False)
print(transcript)

openai-whisper:

import whisper
whisper_model = whisper.load_model("converted_models/openai/whisper-large-small-yt-os-V2")
result = whisper_model.transcribe("long_audio.wav", temperature=0,
                                  condition_on_previous_text=False,
                                  language="french", without_timestamps=True, word_timestamps=True)

faster-whisper:

from faster_whisper import WhisperModel
model = WhisperModel("BrunoHays/whisper-large-v3-french-illuin-ctranslate2-fp16", device="cpu")

segments, info = model.transcribe("long_audio.wav",
                                  without_timestamps=True,
                                  word_timestamps=True,
                                  temperature=0,
                                  condition_on_previous_text=False,
                                  task="transcribe",
                                  language="fr")

Whisper.cpp:

 ./main -f long_audio.wav -l fr -mc 0 -m ggml-model.bin

BrunoHays
/

whisper-large-v3-french-illuin