--- license: apache-2.0 language: fr library_name: transformers thumbnail: null tags: - automatic-speech-recognition - hf-asr-leaderboard - whisper-event datasets: - mozilla-foundation/common_voice_11_0 - facebook/multilingual_librispeech - facebook/voxpopuli - google/fleurs - gigant/african_accented_french metrics: - wer model-index: - name: Fine-tuned whisper-medium model for ASR in French results: - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: Common Voice 11.0 type: mozilla-foundation/common_voice_11_0 config: fr split: test args: fr metrics: - name: WER (Greedy) type: wer value: 9.03 - name: WER (Beam 5) type: wer value: 8.73 - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: Multilingual LibriSpeech (MLS) type: facebook/multilingual_librispeech config: french split: test args: french metrics: - name: WER (Greedy) type: wer value: 4.60 - name: WER (Beam 5) type: wer value: 4.44 - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: VoxPopuli type: facebook/voxpopuli config: fr split: test args: fr metrics: - name: WER (Greedy) type: wer value: 9.53 - name: WER (Beam 5) type: wer value: 9.46 - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: Fleurs type: google/fleurs config: fr_fr split: test args: fr_fr metrics: - name: WER (Greedy) type: wer value: 6.33 - name: WER (Beam 5) type: wer value: 5.94 - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: African Accented French type: gigant/african_accented_french config: fr split: test args: fr metrics: - name: WER (Greedy) type: wer value: 4.89 - name: WER (Beam 5) type: wer value: 4.56 --- ![Model architecture](https://img.shields.io/badge/Model_Architecture-seq2seq-lightgrey) ![Model size](https://img.shields.io/badge/Params-769M-lightgrey) ![Language](https://img.shields.io/badge/Language-French-lightgrey) # Fine-tuned whisper-medium model for ASR in French This model is a fine-tuned version of [openai/whisper-medium](https://huggingface.co/openai/whisper-medium), trained on a composite dataset comprising of over 2200 hours of French speech audio, using the train and the validation splits of [Common Voice 11.0](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0), [Multilingual LibriSpeech](https://huggingface.co/datasets/facebook/multilingual_librispeech), [Voxpopuli](https://github.com/facebookresearch/voxpopuli), [Fleurs](https://huggingface.co/datasets/google/fleurs), [Multilingual TEDx](http://www.openslr.org/100), [MediaSpeech](https://www.openslr.org/108), and [African Accented French](https://huggingface.co/datasets/gigant/african_accented_french). When using the model make sure that your speech input is sampled at 16Khz. **This model doesn't predict casing or punctuation.** ## Usage Inference with 🤗 Pipeline ```python import torch from datasets import load_dataset from transformers import pipeline device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") # Load pipeline pipe = pipeline("automatic-speech-recognition", model="bofenghuang/whisper-medium-french", device=device) # NB: set forced_decoder_ids for generation utils pipe.model.config.forced_decoder_ids = pipe.tokenizer.get_decoder_prompt_ids(language="fr", task="transcribe") # Load data ds_mcv_test = load_dataset("mozilla-foundation/common_voice_11_0", "fr", split="test", streaming=True) test_segment = next(iter(ds_mcv_test)) waveform = test_segment["audio"] # Run generated_sentences = pipe(waveform, max_new_tokens=225)["text"] # greedy # generated_sentences = pipe(waveform, max_new_tokens=225, generate_kwargs={"num_beams": 5})["text"] # beam search # Normalise predicted sentences if necessary ``` Inference with 🤗 low-level APIs ```python import torch import torchaudio from datasets import load_dataset from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") # Load model model = AutoModelForSpeechSeq2Seq.from_pretrained("bofenghuang/whisper-medium-french").to(device) processor = AutoProcessor.from_pretrained("bofenghuang/whisper-medium-french", language="french", task="transcribe") # NB: set forced_decoder_ids for generation utils model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(language="fr", task="transcribe") # 16_000 model_sample_rate = processor.feature_extractor.sampling_rate # Load data ds_mcv_test = load_dataset("mozilla-foundation/common_voice_11_0", "fr", split="test", streaming=True) test_segment = next(iter(ds_mcv_test)) waveform = torch.from_numpy(test_segment["audio"]["array"]) sample_rate = test_segment["audio"]["sampling_rate"] # Resample if sample_rate != model_sample_rate: resampler = torchaudio.transforms.Resample(sample_rate, model_sample_rate) waveform = resampler(waveform) # Get feat inputs = processor(waveform, sampling_rate=model_sample_rate, return_tensors="pt") input_features = inputs.input_features input_features = input_features.to(device) # Generate generated_ids = model.generate(inputs=input_features, max_new_tokens=225) # greedy # generated_ids = model.generate(inputs=input_features, max_new_tokens=225, num_beams=5) # beam search # Detokenize generated_sentences = processor.batch_decode(generated_ids, skip_special_tokens=True)[0] # Normalise predicted sentences if necessary ```